About this model
Gemma 3 27B is the largest model in Google's third-generation open Gemma family, offered here as a privacy-focused build that executes inside a Trusted Execution Environment, with hardware attestation evidence available so users can independently verify the runtime. The underlying model is a decoder-only transformer paired with a SigLIP vision encoder, letting it analyze images alongside text, support 140+ languages, and follow instructions with structured outputs and function calling.
Compared with its own predecessor, Gemma 2, the jump is substantial. Gemma 3 introduces multimodal vision-language understanding that Gemma 2 lacked, expands the context window from Gemma 2's 8K up to 128K tokens, and adopts a Gemini-style tokenizer for stronger multilingual coverage. Hugging Face's release notes describe an interleaved local-to-global attention design that cuts KV-cache memory during long-context inference relative to earlier Gemma designs.
On Google-reported evaluations in the model card, the 27B instruction-tuned model scores 67.5 on MMLU-Pro, 69.0 on MATH, 42.4 on GPQA Diamond, and 64.9 on MMMU. Within this catalog, it is part of a broader Google lineup that includes the newer Gemma 4 31B Instruct and the standard Google Gemma 3 27B Instruct release, alongside siblings such as Gemini 3.5 Flash. The Gemma license governs usage, and quantization-aware variants make local deployment on a single GPU feasible.
This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.
Research & Papers
28 reference papers linked from the HuggingFace model card.
HellaSwag: Can a Machine Really Finish Your Sentence?(2019)
Rowan Zellers, Ari Holtzman, Yonatan Bisk et al.
Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT,…
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions(2019)
Christopher Clark, Kenton Lee, Ming-Wei Chang et al.
In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often…
PIQA: Reasoning about Physical Commonsense in Natural Language(2019)
Yonatan Bisk, Rowan Zellers, Ronan Le Bras et al.
To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today's natural language understanding systems. While recent pretrained models (such as BERT) have made progress on question…
SocialIQA: Commonsense Reasoning about Social Interactions(2019)
Maarten Sap, Hannah Rashkin, Derek Chen et al.
We introduce Social IQa, the first largescale benchmark for commonsense reasoning about social situations. Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations (e.g., Q: "Jordan wanted to tell…
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension(2017)
Mandar Joshi, Eunsol Choi, Daniel S. Weld et al.
We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that…
On the Measure of Intelligence(2019)
François Chollet
To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as…
WinoGrande: An Adversarial Winograd Schema Challenge at Scale(2019)
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula et al.
The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word…
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs(2019)
Dheeru Dua, Yizhong Wang, Pradeep Dasigi et al.
Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new…
Measuring Massive Multitask Language Understanding(2020)
Dan Hendrycks, Collin Burns, Steven Basart et al.
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving…
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models(2023)
Wanjun Zhong, Ruixiang Cui, Yiduo Guo et al.
Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately…
Measuring Mathematical Problem Solving With the MATH Dataset(2021)
Dan Hendrycks, Collin Burns, Saurav Kadavath et al.
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each…
Training Verifiers to Solve Math Word Problems(2021)
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian et al.
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality…
GPQA: A Graduate-Level Google-Proof Q&A Benchmark(2023)
David Rein, Betty Li Hou, Asa Cooper Stickland et al.
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach…
Program Synthesis with Large Language Models(2021)
Jacob Austin, Augustus Odena, Maxwell Nye et al.
This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in…
Evaluating Large Language Models Trained on Code(2021)
Mark Chen, Jerry Tworek, Heewoo Jun et al.
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional…
Language Models are Multilingual Chain-of-Thought Reasoners(2022)
Freda Shi, Mirac Suzgun, Markus Freitag et al.
We evaluate the reasoning abilities of large language models in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically…
The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation(2021)
Naman Goyal, Cynthia Gao, Vishrav Chaudhary et al.
One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low…
On the Cross-lingual Transferability of Monolingual Representations(2019)
Mikel Artetxe, Sebastian Ruder, Dani Yogatama
State-of-the-art unsupervised multilingual models (e.g., multilingual BERT) have been shown to generalize in a zero-shot cross-lingual setting. This generalization ability has been attributed to the use of a shared subword vocabulary and joint training across multiple languages…
WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects(2025)
Daniel Deutsch, Eleftheria Briakou, Isaac Caswell et al.
As large language models (LLM) become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation (MT). In this work, we extend the WMT24…
ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer(2025)
Omer Goldman, Uri Shaham, Dan Malkin et al.
To achieve equitable performance across languages, large language models (LLMs) must be able to abstract knowledge beyond the language in which it was learnt. However, the current literature lacks reliable ways to measure LLMs' capability of such cross-lingual knowledge…
IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages(2024)
Harman Singh, Nitish Gupta, Shikhar Bharadwaj et al.
As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM…
InfographicVQA(2021)
Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito et al.
Infographics are documents designed to effectively communicate information using a combination of textual, graphical and visual elements. In this work, we explore the automatic understanding of infographic images by using Visual Question Answering technique.To this end, we…
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI(2023)
Xiang Yue, Yuansheng Ni, Kai Zhang et al.
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and…
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning(2022)
Ahmed Masry, Do Xuan Long, Jia Qing Tan et al.
Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations. They also commonly refer to visual features of a chart in their questions. However, most existing…
BLINK: Multimodal Large Language Models Can See but Not Perceive(2024)
Xingyu Fu, Yushi Hu, Bangzheng Li et al.
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence,…
TallyQA: Answering Complex Counting Questions(2018)
Manoj Acharya, Kushal Kafle, Christopher Kanan
Most counting questions in visual question answering (VQA) datasets are simple and require no more than object detection. Here, we study algorithms for complex counting questions that involve relationships between objects, attribute identification, reasoning, and more. To do…
SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition(2019)
Kaiyu Yang, Olga Russakovsky, Jia Deng
Understanding the spatial relations between objects in images is a surprisingly challenging task. A chair may be "behind" a person even if it appears to the left of the person in the image (depending on which way the person is facing). Two students that appear close to each…
Gemini: A Family of Highly Capable Multimodal Models(2023)
Gemini Team, Rohan Anil, Sebastian Borgeaud et al.
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to…
Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 1d ago