Nvidia·💬 Text Generation

NVIDIA Nemotron 3 Ultra

ReasoningFunction CallingWeb Searchfp8private

🧠 Try in Intelligence →

Try on Venice.ai ↗

Quick reference

NVIDIA Nemotron 3 Ultra 550B — TLDR

🧠 Frontier open reasoning model built for long-running agents.
📏 550B total parameters, 55B active per token.
🔧 Hybrid Mamba-Transformer Mixture-of-Experts architecture.
📚 Supports up to 1M-token context for sustained agent sessions.
⚡ NVIDIA reports up to 5x faster inference, ~30% lower agentic cost.
🎯 Toggleable reasoning modes with an optional token budget.
🌐 Multilingual, with broad programming-language coverage.
🏢 Open weights for commercial use; NVFP4, FP8 and BF16 variants.

💰 Pricing

$0.625 / $3.13

per 1M · input / output

📏 Context

256K tokens

📅 On Venice since

Jun 4, 2026

46 days ago

Provider

Nvidia

Nvidia Corporation is an American technology company founded in 1993 by Jensen Huang, Chris Malachowsky, and Curtis Priem, headquartered in Santa Clara, California. Long recognized as the dominant force in graphics processing units, Nvidia has expanded into a…

Read full profile →

5 models on Venice

3 text · 1 embedding · 1 asr

Since Oct 10, 2025

Wikipedia ↗Official site ↗

See 4 other models from Nvidia →

About this model

NVIDIA Nemotron 3 Ultra is the flagship, frontier-scale tier of NVIDIA's Nemotron 3 open-model family, released in 2026 and built specifically for orchestration, coding agents, deep research, and complex enterprise workflows. It is a 550B-parameter Mixture-of-Experts model that activates only 55B parameters per token, keeping throughput high even at long context lengths.

Architecturally, it pairs a hybrid Mamba-Transformer backbone with Mixture-of-Experts routing, and ships with toggleable reasoning modes plus an optional token budget for controlling effort. NVIDIA documents the model in its technical blog as engineered for faster, more efficient reasoning on long-running agents, reporting up to 5x faster inference and up to roughly 30% lower cost for agentic workloads versus prior approaches. Weights are published in NVFP4, FP8 and BF16 formats for commercial use.

Within the family, Ultra is the heavyweight reasoning tier above smaller, higher-volume siblings like NVIDIA Nemotron 3 Nano 30B and Nemotron Cascade 2 30B A3B, and complements the retrieval-focused Nemotron Embed VL 1B v2. NVIDIA positions Ultra for the harder calls in an agent pipeline, with smaller models handling routine execution. The published context window reaches up to 1M tokens; this catalog entry lists a 256K window with FP8 quantization.

Sources

NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4build.nvidia.com ↗

NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents | NVIDIA Technical Blogdeveloper.nvidia.com ↗

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Facehuggingface.co ↗

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Data sources: Venice API · HuggingFace · Wikipedia — enrichment updated 3d ago