GoogleGoogle·💬 Text Generation

Google Gemma 3 27B Instruct

VisionFunction CallingWeb Searchfp8private
🧠 Try in Intelligence →Try on Venice.ai ↗
Quick reference
Google Gemma 3 27B Instruct — TLDR
  • 🆕 Adds multimodality: vision-language input, text output
  • 📏 128K-token context window, output in 140+ languages
  • 🧠 Improved math, reasoning, coding over Gemma 2
  • 🔧 Supports structured outputs and function calling
  • 👁️ Integrated SigLIP-based frozen vision encoder
  • 🏢 Google's open-weight model, successor to Gemma 2
  • 📚 Trained on TPUs with distillation and RL feedback
  • ⚡ Designed to run on a single GPU/TPU
💰 Pricing
$0.120 / $0.200
per 1M · input / output
📏 Context
198K tokens
📅 On Venice since
Nov 4, 2025
212 days ago
Provider

Google is an American multinational technology corporation and one of the world's most valuable brands. A subsidiary of parent company Alphabet Inc., Google operates across search, cloud computing, consumer electronics, and artificial intelligence. Its…

Read full profile →
25 models on Venice
10 text · 8 video · 2 image · 2 inpaint · 1 music · 1 embedding · 1 tts
Since Oct 15, 2024

About this model

Gemma 3 27B Instruct is the largest instruction-tuned model in Google's Gemma 3 open-weight family, released alongside smaller siblings. Positioned as the direct successor to Gemma 2, it introduces multimodality for the first time in the line: an integrated SigLIP-based vision encoder lets it accept image inputs and produce text outputs, while earlier Gemma generations were text-only. Google describes Gemma 3 as its most capable model that can run on a single GPU or TPU.

Compared with its predecessor, Gemma 3 expands the context window to 128K tokens, adopts a tokenizer with stronger multilingual coverage across 140-plus languages, and adds structured outputs and function calling. Google attributes improved math, coding, and instruction-following to a training pipeline using distillation plus reinforcement learning from human, machine, and execution feedback, trained on TPUs.

The model handles tasks such as analyzing images, answering visual questions, comparing images, and reading text within an image. Google also notes that quantization-aware training lets the 27B model run on more modest hardware.

Within the broader Gemma line, this release was later succeeded by Google Gemma 4 31B Instruct and the mixture-style Google Gemma 4 26B A4B Instruct, which continue the family's open-weight approach.

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Research & Papers

28 reference papers linked from the HuggingFace model card.

arXiv1905.07830May 2019

HellaSwag: Can a Machine Really Finish Your Sentence?(2019)

Rowan Zellers, Ari Holtzman, Yonatan Bisk et al.

Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT,…

arXiv1905.10044May 2019

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions(2019)

Christopher Clark, Kenton Lee, Ming-Wei Chang et al.

In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often…

arXiv1911.11641Nov 2019

PIQA: Reasoning about Physical Commonsense in Natural Language(2019)

Yonatan Bisk, Rowan Zellers, Ronan Le Bras et al.

To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today's natural language understanding systems. While recent pretrained models (such as BERT) have made progress on question…

arXiv1904.09728Apr 2019

SocialIQA: Commonsense Reasoning about Social Interactions(2019)

Maarten Sap, Hannah Rashkin, Derek Chen et al.

We introduce Social IQa, the first largescale benchmark for commonsense reasoning about social situations. Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations (e.g., Q: "Jordan wanted to tell…

arXiv1705.03551May 2017

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension(2017)

Mandar Joshi, Eunsol Choi, Daniel S. Weld et al.

We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that…

arXiv1911.01547Nov 2019

On the Measure of Intelligence(2019)

François Chollet

To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as…

arXiv1907.10641Jul 2019

WinoGrande: An Adversarial Winograd Schema Challenge at Scale(2019)

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula et al.

The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word…

arXiv1903.00161Mar 2019

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs(2019)

Dheeru Dua, Yizhong Wang, Pradeep Dasigi et al.

Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new…

arXiv2009.03300Sep 2020

Measuring Massive Multitask Language Understanding(2020)

Dan Hendrycks, Collin Burns, Steven Basart et al.

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving…

arXiv2304.06364Apr 2023

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models(2023)

Wanjun Zhong, Ruixiang Cui, Yiduo Guo et al.

Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately…

arXiv2103.03874Mar 2021

Measuring Mathematical Problem Solving With the MATH Dataset(2021)

Dan Hendrycks, Collin Burns, Saurav Kadavath et al.

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each…

arXiv2110.14168Oct 2021

Training Verifiers to Solve Math Word Problems(2021)

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian et al.

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality…

arXiv2311.12022Nov 2023

GPQA: A Graduate-Level Google-Proof Q&A Benchmark(2023)

David Rein, Betty Li Hou, Asa Cooper Stickland et al.

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach…

arXiv2108.07732Aug 2021

Program Synthesis with Large Language Models(2021)

Jacob Austin, Augustus Odena, Maxwell Nye et al.

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in…

arXiv2107.03374Jul 2021

Evaluating Large Language Models Trained on Code(2021)

Mark Chen, Jerry Tworek, Heewoo Jun et al.

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional…

arXiv2210.03057Oct 2022

Language Models are Multilingual Chain-of-Thought Reasoners(2022)

Freda Shi, Mirac Suzgun, Markus Freitag et al.

We evaluate the reasoning abilities of large language models in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically…

arXiv2106.03193Jun 2021

The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation(2021)

Naman Goyal, Cynthia Gao, Vishrav Chaudhary et al.

One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low…

arXiv1910.11856Oct 2019

On the Cross-lingual Transferability of Monolingual Representations(2019)

Mikel Artetxe, Sebastian Ruder, Dani Yogatama

State-of-the-art unsupervised multilingual models (e.g., multilingual BERT) have been shown to generalize in a zero-shot cross-lingual setting. This generalization ability has been attributed to the use of a shared subword vocabulary and joint training across multiple languages…

arXiv2502.12404Feb 2025

WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects(2025)

Daniel Deutsch, Eleftheria Briakou, Isaac Caswell et al.

As large language models (LLM) become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation (MT). In this work, we extend the WMT24…

arXiv2502.21228Feb 2025

ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer(2025)

Omer Goldman, Uri Shaham, Dan Malkin et al.

To achieve equitable performance across languages, large language models (LLMs) must be able to abstract knowledge beyond the language in which it was learnt. However, the current literature lacks reliable ways to measure LLMs' capability of such cross-lingual knowledge…

arXiv2404.16816Apr 2024

IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages(2024)

Harman Singh, Nitish Gupta, Shikhar Bharadwaj et al.

As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM…

arXiv2104.12756Apr 2021

InfographicVQA(2021)

Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito et al.

Infographics are documents designed to effectively communicate information using a combination of textual, graphical and visual elements. In this work, we explore the automatic understanding of infographic images by using Visual Question Answering technique.To this end, we…

arXiv2311.16502Nov 2023

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI(2023)

Xiang Yue, Yuansheng Ni, Kai Zhang et al.

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and…

arXiv2203.10244Mar 2022

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning(2022)

Ahmed Masry, Do Xuan Long, Jia Qing Tan et al.

Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations. They also commonly refer to visual features of a chart in their questions. However, most existing…

arXiv2404.12390Apr 2024

BLINK: Multimodal Large Language Models Can See but Not Perceive(2024)

Xingyu Fu, Yushi Hu, Bangzheng Li et al.

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence,…

arXiv1810.12440Oct 2018

TallyQA: Answering Complex Counting Questions(2018)

Manoj Acharya, Kushal Kafle, Christopher Kanan

Most counting questions in visual question answering (VQA) datasets are simple and require no more than object detection. Here, we study algorithms for complex counting questions that involve relationships between objects, attribute identification, reasoning, and more. To do…

arXiv1908.02660Aug 2019

SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition(2019)

Kaiyu Yang, Olga Russakovsky, Jia Deng

Understanding the spatial relations between objects in images is a surprisingly challenging task. A chair may be "behind" a person even if it appears to the left of the person in the image (depending on which way the person is facing). Two students that appear close to each…

arXiv2312.11805Dec 2023

Gemini: A Family of Highly Capable Multimodal Models(2023)

Gemini Team, Rohan Anil, Sebastian Borgeaud et al.

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to…

Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 1d ago