Nvidia·📐 Embeddings

Nemotron Embed VL 1B v2

private

Try on Venice.ai ↗

Quick reference

Nemotron Embed VL 1B v2 — TLDR

🧠 Multimodal bi-encoder embedding model for document retrieval and RAG.
👁️ Embeds text, document images, or combined image-plus-text inputs.
🔧 Built on SigLIP 2 400M vision encoder plus Llama 3.2 1B.
📏 Roughly 1.23B parameters; evaluated up to 10,240-token context.
🎯 Outputs fixed-size dense vectors for nearest-neighbor search.
🏢 Part of NVIDIA NeMo Retriever microservice collection, commercially licensed.
📚 Evaluated across BEIR, MIRACL, MLQA and MLDR retrieval benchmarks.
🌐 Supports images containing text, tables, charts, and infographics.

💰 Pricing

$0.013

per 1M tokens

📅 On Venice since

Apr 17, 2026

93 days ago

Provider

Nvidia

Nvidia Corporation is an American technology company founded in 1993 by Jensen Huang, Chris Malachowsky, and Curtis Priem, headquartered in Santa Clara, California. Long recognized as the dominant force in graphics processing units, Nvidia has expanded into a…

Read full profile →

5 models on Venice

3 text · 1 embedding · 1 asr

Since Oct 10, 2025

Wikipedia ↗Official site ↗

See 4 other models from Nvidia →

About this model

Nemotron Embed VL 1B v2 (officially llama-nemotron-embed-vl-1b-v2) is NVIDIA's multimodal embedding model built for question-answering retrieval over large corpora. It is a bi-encoder that vectorizes "documents" supplied as text, document images, or combined image-plus-text, then matches them against a text query in a shared embedding space — the foundation of a retrieval-augmented-generation pipeline. Architecturally it combines a SigLIP 2 400M vision encoder with a Llama 3.2 1B language model, totaling roughly 1.23B parameters, and was evaluated at context lengths up to 10,240 tokens with each image split into tiles consuming about 256 visual tokens apiece.

The "VL" variant extends the text-only llama-nemotron-embed-1b-v2 — which targets multilingual and cross-lingual retrieval with Matryoshka dimensions up to 2048 — by adding native vision so document pages need not be OCR-converted before indexing. Both share the same Llama 3.2 1B backbone, but the VL model preserves layout, tables, and charts directly from page images.

According to NVIDIA's accompanying technical paper, the single-vector design requires substantially less storage than late-interaction alternatives sharing the same backbone — about 3.8 GB versus thousands of gigabytes. The paper reports the model reaching NDCG@10 of 48.69 on visual document retrieval, rising to 54.40 when paired with the companion llama-nemotron-rerank-vl-1b-v2 cross-encoder reranking the top results.

The model ships as a NeMo Retriever NIM and is part of NVIDIA's broader Nemotron ecosystem alongside text models like Nemotron Cascade 2 30B A3B and NVIDIA Nemotron 3 Nano 30B.

🤗View model card on HuggingFace ↗View source on GitHub ↗

Sources

llama-nemotron-embed-vl-1b-v2 Model by NVIDIAbuild.nvidia.com ↗

nvidia / llama-nemotron-embed-vl-1b-v2docs.api.nvidia.com ↗

Embedding Model Fine-Tuning Recipe — Nemotrondocs.nvidia.com ↗

Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrievalarxiv.org ↗

nvidia/llama-nemotron-embed-vl-1b-v2 · Hugging Facehuggingface.co ↗

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Research & Papers

Primary reference paper for this model family, sourced from the HuggingFace model card.

arXiv2501.14818Jan 2025

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models(2025)

Zhiqi Li, Guo Chen, Shilong Liu et al.

Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data…

Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 14h ago