About this model
Multilingual E5 Large Instruct is a text-embedding model published by intfloat (researcher Liang Wang and collaborators) and documented in the "Multilingual E5 Text Embeddings" technical report. It uses an XLM-RoBERTa-large backbone with 24 layers and about 560 million parameters, producing 1024-dimensional dense vectors, and supports roughly 100 languages for tasks such as multilingual retrieval, semantic similarity, clustering, and classification.
The defining change relative to the base multilingual-e5-large model is instruction tuning. According to the model card, each query is paired with a one-sentence natural-language instruction describing the task, letting one model adapt its embeddings to different scenarios without retraining; instructions are added only to the query side, not to documents. This contrasts with the earlier E5 approach of fixed "query:" and "passage:" prefixes used to distinguish input types.
The provider evaluates the family on the MTEB benchmark suite in its technical report, though specific scores should be checked against that report directly.
Practically, the model is compact at around 0.56 GB, normalizes to cosine-similarity scores that cluster between roughly 0.7 and 1.0 due to a low InfoNCE temperature, and is distributed openly under the MIT license, making it straightforward to self-host or serve through third-party inference APIs.
This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.
Research & Papers
4 reference papers linked from the HuggingFace model card.
Multilingual E5 Text Embeddings: A Technical Report(2024)
Liang Wang, Nan Yang, Xiaolong Huang et al.
This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference…
Improving Text Embeddings with Large Language Models(2023)
Liang Wang, Nan Yang, Xiaolong Huang et al.
In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text…
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models(2021)
Nandan Thakur, Nils Reimers, Andreas Rücklé et al.
Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly…
MTEB: Massive Text Embedding Benchmark(2022)
Niklas Muennighoff, Nouamane Tazi, Loïc Magne et al.
Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like…
Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 4d ago