Alibaba·💬 Text Generation

Qwen3 VL 30B A3B🔒Private

VisionFunction CallingWeb SearchE2EEprivate

🧠 Try in Intelligence →

Try on Venice.ai ↗

Quick reference

Qwen3 VL 30B A3B — TLDR

- 🧠 Mixture-of-experts vision-language model with about 3B active parameters.
- 👁️ Unifies text generation with image and video understanding.
- 📏 Serves a 128K-token context window for long multimodal inputs.
- 🔒 Runs inside a Trusted Execution Environment with hardware attestation.
- 🔧 Supports function calling and web search.
- 🌐 Apache-2.0 licensed, released by Alibaba's Qwen team.
- 🏢 Part of the broad Qwen3 multimodal lineup from Alibaba Cloud.

💰 Pricing

$0.250 / $0.900

per 1M · input / output

📏 Context

128K tokens

📅 On Venice since

Mar 18, 2026

123 days ago

Provider

Alibaba

Alibaba Group is a Chinese multinational technology company founded in 1999 and headquartered in Hangzhou, Zhejiang. Originally built around e-commerce and cloud computing, Alibaba has become one of the most prolific contributors to open-weight AI research,…

Read full profile →

51 models on Venice

20 video · 18 text · 5 image · 4 inpaint · 2 embedding · 2 tts

Since Jan 11, 2025

Wikipedia ↗Official site ↗

See 50 other models from Alibaba →

About this model

Qwen3 VL 30B A3B is the multimodal branch of Alibaba's Qwen3 family, pairing a language backbone with a vision encoder so a single model can read text, images, and video. Its name reflects a mixture-of-experts design with roughly 3B parameters active out of 30B total, a layout the Qwen team uses across the Qwen3 series to keep inference efficient relative to capacity. This Venice deployment adds a privacy layer: the model runs inside a Trusted Execution Environment and exposes hardware attestation evidence for independent verification.

Compared with its same-family relatives, this checkpoint sits below the larger Qwen3 VL 235B, which applies the same vision-language approach at far greater scale. It also extends the text-only Qwen3 30B A3B: where that sibling handles language alone, the VL variant grafts on visual perception while retaining the comparable 30B-A3B architecture and a 128K context window.

Per the catalog configuration, the model supports vision input, function calling, and web search, making it suitable for document understanding, multi-image tasks, and tool-using agent workflows. As with any vision-language system, real-world accuracy on your specific images and documents is best confirmed through direct testing.

🤗View model card on HuggingFace ↗View source on GitHub ↗

Sources

Qwen3: Think Deeper, Act Faster | Qwenqwenlm.github.io ↗

Qwen/Qwen3-Omni-30B-A3B-Instruct · Hugging Facehuggingface.co ↗

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Research & Papers

4 reference papers linked from the HuggingFace model card.

arXiv2505.09388May 2025

Qwen3 Technical Report(2025)

An Yang, Anfeng Li, Baosong Yang et al.

In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert…

arXiv2502.13923Feb 2025

Qwen2.5-VL Technical Report(2025)

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the…

arXiv2409.12191Sep 2024

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution(2024)

Peng Wang, Shuai Bai, Sinan Tan et al.

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process…

arXiv2308.12966Aug 2023

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond(2023)

Jinze Bai, Shuai Bai, Shusheng Yang et al.

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual…

Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 4d ago