MiMo-V2.5
About this model
MiMo-V2.5 is Xiaomi's native omnimodal model, designed to understand text, images, video, and audio within a single unified architecture. It uses a sparse Mixture-of-Experts backbone with 310B total parameters and roughly 15B active per token, and supports a context window of up to 1 million tokens. Xiaomi released the model in 2026 and open-sourced the weights and tokenizer, along with a separate Base checkpoint, under the MIT license on Hugging Face.
Beyond perception, the model is oriented toward agentic and developer workflows. Its documented capabilities include reasoning, function calling, web search, and code-focused use, with audio accepted as a native input modality alongside text, images, and video. The weights are distributed in FP8 quantization, which lowers the memory footprint for serving the large MoE network.
MiMo-V2.5 belongs to Xiaomi's broader MiMo series of open models, and an accompanying Base variant is published for further fine-tuning and research. As an omnimodal release with a long-context MoE design, it extends the family's focus toward unified multimodal understanding and agentic tool use rather than text-only generation. Because the catalog lists no sibling models here, this entry is described from the model's own card and configuration rather than direct head-to-head family comparisons.
This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.
Data sources: Venice API · HuggingFace · Wikipedia — enrichment updated 1d ago