Google·💬 Text Generation

Gemma 3 27B🔒Private

Web SearchE2EEprivate

🧠 Try in Intelligence →

Try on Venice.ai ↗

Quick reference

Gemma 3 27B — TLDR

🔒 Runs in a Trusted Execution Environment with hardware attestation.
🧠 Google's 27B open-weight multimodal vision-language model.
👁️ Adds native image understanding via a SigLIP vision encoder.
🌐 Understands 140+ languages using a Gemini-derived tokenizer.
📏 Natively supports 128k context; Venice exposes a 40,000-token window.
🆕 Big jump from Gemma 2: multimodality, longer context, RoPE rescaling.
🔧 Adds function calling and structured outputs.
🔒 Attestation evidence available for independent verification.

💰 Pricing

$0.140 / $0.500

per 1M · input / output

📏 Context

40K tokens

📅 On Venice since

Mar 18, 2026

124 days ago

Provider

Google

Google is an American multinational technology corporation and one of the world's most valuable brands. A subsidiary of parent company Alphabet Inc., Google operates across search, cloud computing, consumer electronics, and artificial intelligence. Its…

Read full profile →

30 models on Venice

11 video · 10 text · 3 image · 3 inpaint · 1 music · 1 embedding · 1 tts

Since Oct 15, 2024

Wikipedia ↗Official site ↗

See 29 other models from Google →

About this model

Gemma 3 27B is the largest model in Google's Gemma 3 open-weight family, a text-and-image-input, text-output vision-language model with roughly 27 billion parameters. In this Venice deployment it runs inside a Trusted Execution Environment (TEE) with hardware attestation, so users can independently verify that prompts are processed in an encrypted, isolated enclave — the privacy layer, rather than a change to the underlying weights. Venice exposes it with a 40,000-token context window, though the base model natively supports up to 128k tokens.

Relative to its own predecessor, Gemma 2, the generational gains are substantial. Google added multimodality: the 4B, 12B and 27B models employ a custom SigLIP vision encoder that lets them interpret images and short video. Context length grew from Gemma 2's 8k to 128k via RoPE rescaling (global-layer base frequency raised from 10k to 1M), and a new Gemini-style tokenizer plus revised data mixture improved coverage across 140+ languages. Google reports enhanced math, coding and instruction-following, with architectural changes cutting KV-cache memory during long-context inference.

On Hugging Face's reported benchmarks, the 27B instruction-tuned model scores 67.5 on MMLU-Pro and around 69 on MATH, alongside multimodal scores like 64.9 on MMMU. Within Google's broader lineup this model has since been succeeded by newer generations, including Gemma 4 31B Instruct and Gemma 4 26B A4B Uncensored, which offer further architectural updates. Gemma 3 27B remains a capable, resource-efficient open option for chat, document analysis and vision-language tasks.

🤗View model card on HuggingFace ↗View source on GitHub ↗

Sources

google/gemma-3-27b-it · Hugging Facehuggingface.co ↗

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Research & Papers

28 reference papers linked from the HuggingFace model card.

arXiv1905.07830May 2019

HellaSwag: Can a Machine Really Finish Your Sentence?(2019)

Rowan Zellers, Ari Holtzman, Yonatan Bisk et al.

Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT,…

arXiv1905.10044May 2019

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions(2019)

Christopher Clark, Kenton Lee, Ming-Wei Chang et al.

In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often…

arXiv1911.11641Nov 2019

PIQA: Reasoning about Physical Commonsense in Natural Language(2019)

Yonatan Bisk, Rowan Zellers, Ronan Le Bras et al.

To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today's natural language understanding systems. While recent pretrained models (such as BERT) have made progress on question…

arXiv1904.09728Apr 2019

SocialIQA: Commonsense Reasoning about Social Interactions(2019)

Maarten Sap, Hannah Rashkin, Derek Chen et al.

We introduce Social IQa, the first largescale benchmark for commonsense reasoning about social situations. Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations (e.g., Q: "Jordan wanted to tell…

arXiv1705.03551May 2017

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension(2017)

Mandar Joshi, Eunsol Choi, Daniel S. Weld et al.

We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that…

arXiv1911.01547Nov 2019

On the Measure of Intelligence(2019)

François Chollet

To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as…

arXiv1907.10641Jul 2019

WinoGrande: An Adversarial Winograd Schema Challenge at Scale(2019)

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula et al.

The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word…

arXiv1903.00161Mar 2019

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs(2019)

Dheeru Dua, Yizhong Wang, Pradeep Dasigi et al.

Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new…

arXiv2009.03300Sep 2020

Measuring Massive Multitask Language Understanding(2020)

Dan Hendrycks, Collin Burns, Steven Basart et al.

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving…

arXiv2304.06364Apr 2023

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models(2023)

Wanjun Zhong, Ruixiang Cui, Yiduo Guo et al.

Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately…

arXiv2103.03874Mar 2021

Measuring Mathematical Problem Solving With the MATH Dataset(2021)

Dan Hendrycks, Collin Burns, Saurav Kadavath et al.

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each…

arXiv2110.14168Oct 2021

Training Verifiers to Solve Math Word Problems(2021)

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian et al.

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality…

arXiv2311.12022Nov 2023

GPQA: A Graduate-Level Google-Proof Q&A Benchmark(2023)

David Rein, Betty Li Hou, Asa Cooper Stickland et al.

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach…

arXiv2108.07732Aug 2021

Program Synthesis with Large Language Models(2021)

Jacob Austin, Augustus Odena, Maxwell Nye et al.

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in…

arXiv2107.03374Jul 2021

Evaluating Large Language Models Trained on Code(2021)

Mark Chen, Jerry Tworek, Heewoo Jun et al.

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional…

arXiv2210.03057Oct 2022

Language Models are Multilingual Chain-of-Thought Reasoners(2022)

Freda Shi, Mirac Suzgun, Markus Freitag et al.

We evaluate the reasoning abilities of large language models in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically…

arXiv2106.03193Jun 2021

The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation(2021)

Naman Goyal, Cynthia Gao, Vishrav Chaudhary et al.

One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low…

arXiv1910.11856Oct 2019

On the Cross-lingual Transferability of Monolingual Representations(2019)

Mikel Artetxe, Sebastian Ruder, Dani Yogatama

State-of-the-art unsupervised multilingual models (e.g., multilingual BERT) have been shown to generalize in a zero-shot cross-lingual setting. This generalization ability has been attributed to the use of a shared subword vocabulary and joint training across multiple languages…

arXiv2502.12404Feb 2025

WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects(2025)

Daniel Deutsch, Eleftheria Briakou, Isaac Caswell et al.

As large language models (LLM) become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation (MT). In this work, we extend the WMT24…

arXiv2502.21228Feb 2025

ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer(2025)

Omer Goldman, Uri Shaham, Dan Malkin et al.

To achieve equitable performance across languages, large language models (LLMs) must be able to abstract knowledge beyond the language in which it was learnt. However, the current literature lacks reliable ways to measure LLMs' capability of such cross-lingual knowledge…

arXiv2404.16816Apr 2024

IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages(2024)

Harman Singh, Nitish Gupta, Shikhar Bharadwaj et al.

As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM…

arXiv2104.12756Apr 2021

InfographicVQA(2021)

Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito et al.

Infographics are documents designed to effectively communicate information using a combination of textual, graphical and visual elements. In this work, we explore the automatic understanding of infographic images by using Visual Question Answering technique.To this end, we…

arXiv2311.16502Nov 2023

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI(2023)

Xiang Yue, Yuansheng Ni, Kai Zhang et al.

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and…

arXiv2203.10244Mar 2022

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning(2022)

Ahmed Masry, Do Xuan Long, Jia Qing Tan et al.

Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations. They also commonly refer to visual features of a chart in their questions. However, most existing…

arXiv2404.12390Apr 2024

BLINK: Multimodal Large Language Models Can See but Not Perceive(2024)

Xingyu Fu, Yushi Hu, Bangzheng Li et al.

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence,…

arXiv1810.12440Oct 2018

TallyQA: Answering Complex Counting Questions(2018)

Manoj Acharya, Kushal Kafle, Christopher Kanan

Most counting questions in visual question answering (VQA) datasets are simple and require no more than object detection. Here, we study algorithms for complex counting questions that involve relationships between objects, attribute identification, reasoning, and more. To do…

arXiv1908.02660Aug 2019

SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition(2019)

Kaiyu Yang, Olga Russakovsky, Jia Deng

Understanding the spatial relations between objects in images is a surprisingly challenging task. A chair may be "behind" a person even if it appears to the left of the person in the image (depending on which way the person is facing). Two students that appear close to each…

arXiv2312.11805Dec 2023

Gemini: A Family of Highly Capable Multimodal Models(2023)

Gemini Team, Rohan Anil, Sebastian Borgeaud et al.

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to…

Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 5d ago