AlibabaAlibaba·💬 Text Generation

Qwen3 VL 30B A3B

VisionFunction CallingWeb SearchE2EEprivate
🧠 Try in Intelligence →Try on Venice.ai ↗
Quick reference
Qwen3 VL 30B A3B — TLDR
  • - 👁️ Multimodal model unifying text, image, and video understanding.
  • - 🧠 Mixture-of-experts: 30B total parameters, roughly 3B active.
  • - 📏 128K-token context window for long documents and media.
  • - 🔒 Runs in a Trusted Execution Environment with hardware attestation.
  • - 🔧 Supports function calling and web search.
  • - 🏢 Built by Alibaba's Qwen team; Apache-2.0 licensed.
  • - ⚡ Sparse activation keeps inference efficient versus dense peers.
💰 Pricing
$0.250 / $0.900
per 1M · input / output
📏 Context
128K tokens
📅 On Venice since
Mar 18, 2026
77 days ago
Provider

Alibaba Group is a Chinese multinational technology company founded in 1999 and headquartered in Hangzhou, Zhejiang. Originally built around e-commerce and cloud computing, Alibaba has become one of the most prolific contributors to open-weight AI research,…

Read full profile →
46 models on Venice
17 text · 16 video · 5 image · 4 inpaint · 2 embedding · 2 tts
Since Jan 11, 2025

About this model

Qwen3 VL 30B A3B is Alibaba's vision-language model from the Qwen3-VL series, offered here inside a Trusted Execution Environment so the deployment can be independently verified via hardware attestation. It accepts images, text, and video, outputting text for tasks like multi-image reasoning, document understanding, and grounded multimodal dialogue. The "A3B" denotes a Mixture-of-Experts design where only about 3B of the 30B parameters activate per token, favoring efficient inference.

Within the Qwen3-VL family, the larger sibling Qwen3 VL 235B scales the same architecture to 235B total parameters with more active experts, while this 30B variant targets lighter deployment with a 128K context window. Compared with the text-only Qwen3 30B A3B, which shares the same MoE backbone and active-parameter budget, this VL edition adds a vision encoder and video processing, letting users configure separate pixel budgets for image and video inputs.

The model is distributed openly on Hugging Face under Apache-2.0, and runs on common serving stacks such as vLLM. Here it is paired with capabilities including function calling, web search, and end-to-end-encrypted, attestable execution. For specific benchmark figures, consult Alibaba's official Qwen3-VL materials, since independently verified scores for this exact configuration are not reproduced above.

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Research & Papers

4 reference papers linked from the HuggingFace model card.

arXiv2505.09388May 2025

Qwen3 Technical Report(2025)

An Yang, Anfeng Li, Baosong Yang et al.

In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert…

arXiv2502.13923Feb 2025

Qwen2.5-VL Technical Report(2025)

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the…

arXiv2409.12191Sep 2024

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution(2024)

Peng Wang, Shuai Bai, Sinan Tan et al.

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process…

arXiv2308.12966Aug 2023

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond(2023)

Jinze Bai, Shuai Bai, Shusheng Yang et al.

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual…

Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 1d ago