Meta·💬 Text Generation

Llama 3.2 3B

Function CallingWeb Searchfp16private

🧠 Try in Intelligence →

Try on Venice.ai ↗

Quick reference

Llama 3.2 3B — TLDR

- 📏 Compact 3B text model with 128K-token context window.
- ⚡ Built for low-latency, on-device, edge inference.
- 🔧 Created via structured pruning and distillation from Llama 3.1 8B.
- 🧠 Pre-trained with logits from Llama 3.1 8B and 70B teachers.
- 🎯 Supports function-calling/tool use and web search workloads.
- 🌐 Multilingual: English, German, French, Hindi and more.
- 🏢 Released by Meta on October 3, 2024.
- 🔒 Open weights under the Llama 3.2 community license.

💰 Pricing

$0.150 / $0.600

per 1M · input / output

📏 Context

128K tokens

📅 On Venice since

Oct 3, 2024

608 days ago

Provider

About this model

Llama 3.2 3B is one of Meta's smallest, lightweight, text-only Llama models, alongside a 1B sibling, designed to fit onto select edge and mobile devices for tasks like summarization, instruction following, rewriting, and knowledge retrieval running locally. It carries a 128K-token context window and uses grouped-query attention for efficient inference. Meta partnered with hardware vendors including Qualcomm, MediaTek, and Arm to optimize it for modern mobile SoCs.

Compared with earlier full-size Llama models, the 3B was not trained from scratch but produced through structured one-shot pruning from Llama 3.1 8B, followed by knowledge distillation in which logits from the larger Llama 3.1 8B and 70B models served as token-level training targets. Meta describes this as making the 1B and 3B the first highly capable lightweight Llama models that fit on devices efficiently.

Within this catalog's Llama family, the 3B sits at the efficiency end of the spectrum. The much larger Llama 3.3 70B targets higher-quality reasoning and production hosting, while Hermes 3 Llama 3.1 405b is a community fine-tune built atop the 405B base.

The model is openly released under Meta's Llama 3.2 community license, making it suitable for commercial use, embedded systems, and agentic applications where compute and memory are tightly constrained.

🤗View model card on HuggingFace ↗View source on GitHub ↗

Sources

Introducing Meta Llama 3: The most capable openly available LLM to dateai.meta.com ↗

llama-3.2-3b-instruct Model by Metabuild.nvidia.com ↗

meta-llama/Llama-3.2-3B-Instruct · Hugging Facehuggingface.co ↗

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Research & Papers

2 reference papers linked from the HuggingFace model card.

arXiv2204.05149Apr 2022

The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink(2022)

David Patterson, Joseph Gonzalez, Urs Hölzle et al.

Machine Learning (ML) workloads have rapidly grown in importance, but raised concerns about their carbon footprint. Four best practices can reduce ML training energy by up to 100x and CO2 emissions up to 1000x. By following best practices, overall ML energy use (across research,…

arXiv2405.16406May 2024

SpinQuant: LLM quantization with learned rotations(2024)

Zechun Liu, Changsheng Zhao, Igor Fedorov et al.

Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or…

Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 1d ago