About this model
Llama 3.2 3B is one of Meta's smallest, lightweight, text-only Llama models, alongside a 1B sibling, designed to fit onto select edge and mobile devices for tasks like summarization, instruction following, rewriting, and knowledge retrieval running locally. It carries a 128K-token context window and uses grouped-query attention for efficient inference. Meta partnered with hardware vendors including Qualcomm, MediaTek, and Arm to optimize it for modern mobile SoCs.
Compared with earlier full-size Llama models, the 3B was not trained from scratch but produced through structured one-shot pruning from Llama 3.1 8B, followed by knowledge distillation in which logits from the larger Llama 3.1 8B and 70B models served as token-level training targets. Meta describes this as making the 1B and 3B the first highly capable lightweight Llama models that fit on devices efficiently.
Within this catalog's Llama family, the 3B sits at the efficiency end of the spectrum. The much larger Llama 3.3 70B targets higher-quality reasoning and production hosting, while Hermes 3 Llama 3.1 405b is a community fine-tune built atop the 405B base.
The model is openly released under Meta's Llama 3.2 community license, making it suitable for commercial use, embedded systems, and agentic applications where compute and memory are tightly constrained.
This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.
Research & Papers
2 reference papers linked from the HuggingFace model card.
The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink(2022)
David Patterson, Joseph Gonzalez, Urs Hölzle et al.
Machine Learning (ML) workloads have rapidly grown in importance, but raised concerns about their carbon footprint. Four best practices can reduce ML training energy by up to 100x and CO2 emissions up to 1000x. By following best practices, overall ML energy use (across research,…
SpinQuant: LLM quantization with learned rotations(2024)
Zechun Liu, Changsheng Zhao, Igor Fedorov et al.
Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or…
Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 1d ago