About this model
Nemotron Embed VL 1B v2 (officially llama-nemotron-embed-vl-1b-v2) is NVIDIA's multimodal embedding model built for question-answering retrieval over large corpora. It is a bi-encoder that vectorizes "documents" supplied as text, document images, or combined image-plus-text, then matches them against a text query in a shared embedding space — the foundation of a retrieval-augmented-generation pipeline. Architecturally it combines a SigLIP 2 400M vision encoder with a Llama 3.2 1B language model, totaling roughly 1.23B parameters, and was evaluated at context lengths up to 10,240 tokens with each image split into tiles consuming about 256 visual tokens apiece.
The "VL" variant extends the text-only llama-nemotron-embed-1b-v2 — which targets multilingual and cross-lingual retrieval with Matryoshka dimensions up to 2048 — by adding native vision so document pages need not be OCR-converted before indexing. Both share the same Llama 3.2 1B backbone, but the VL model preserves layout, tables, and charts directly from page images.
According to NVIDIA's accompanying technical paper, the single-vector design requires substantially less storage than late-interaction alternatives sharing the same backbone — about 3.8 GB versus thousands of gigabytes. The paper reports the model reaching NDCG@10 of 48.69 on visual document retrieval, rising to 54.40 when paired with the companion llama-nemotron-rerank-vl-1b-v2 cross-encoder reranking the top results.
The model ships as a NeMo Retriever NIM and is part of NVIDIA's broader Nemotron ecosystem alongside text models like Nemotron Cascade 2 30B A3B and NVIDIA Nemotron 3 Nano 30B.
This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.
Research & Papers
Primary reference paper for this model family, sourced from the HuggingFace model card.
Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 4d ago