AlibabaAlibaba·💬 Text Generation

Qwen3 VL 235B

VisionFunction CallingWeb Searchfp8private
🧠 Try in Intelligence →Try on Venice.ai ↗
Quick reference
Qwen3 VL 235B — TLDR
  • - 🧠 Mixture-of-experts vision-language model: 235B total, 22B active parameters.
  • - 👁️ Vision, OCR, and multimodal reasoning across images and video.
  • - 📏 Native 256K context, reportedly expandable to 1M tokens.
  • - 🔧 Supports function calling and web search; Apache-2.0 licensed.
  • - ⚡ Served here in FP8 quantization for efficient deployment.
  • - 🆕 Qwen's most capable VL generation, per Alibaba's model card.
  • - 🎯 Visual agent skills: GUI operation, spatial grounding, visual coding.
  • - 🏢 Built by Alibaba's Qwen team.
💰 Pricing
$0.250 / $1.50
per 1M · input / output
📏 Context
256K tokens
📅 On Venice since
Jan 16, 2026
139 days ago
Provider

Alibaba Group is a Chinese multinational technology company founded in 1999 and headquartered in Hangzhou, Zhejiang. Originally built around e-commerce and cloud computing, Alibaba has become one of the most prolific contributors to open-weight AI research,…

Read full profile →
46 models on Venice
17 text · 16 video · 5 image · 4 inpaint · 2 embedding · 2 tts
Since Jan 11, 2025

About this model

Qwen3 VL 235B is the flagship vision-language model in Alibaba's Qwen3-VL line, using a mixture-of-experts design with 235 billion total parameters and roughly 22 billion activated per token. It combines text generation with visual perception, OCR, document parsing, and video understanding, and exposes function calling and web search. On this catalog it runs in FP8 quantization, with a 256K-token context window that Qwen documents as natively trained and extendable toward 1M tokens.

Alibaba positions Qwen3-VL as the most powerful vision-language model in the Qwen series to date, describing comprehensive upgrades over earlier generations: superior text understanding, deeper visual perception and reasoning, extended context length, stronger spatial and video-dynamics comprehension, and improved agent interaction. The official model card also cites broadened OCR coverage spanning 32 languages and robustness to low light, blur, and tilt, alongside 2D and 3D visual grounding for spatial reasoning.

The family ships in both dense and MoE architectures and in Instruct plus reasoning-focused Thinking editions, scaling from edge to cloud. Compared with the lighter MoE sibling Qwen3 VL 30B A3B, this 235B variant offers substantially more total capacity for demanding multimodal and document workloads.

Qwen3-VL targets agentic visual tasks, including operating PC and mobile interfaces and generating Draw.io, HTML, CSS, and JavaScript from images or videos. Released under Apache-2.0, the open weights are widely distributed through Hugging Face, making the model accessible for self-hosting and integration.

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Research & Papers

4 reference papers linked from the HuggingFace model card.

arXiv2505.09388May 2025

Qwen3 Technical Report(2025)

An Yang, Anfeng Li, Baosong Yang et al.

In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert…

arXiv2502.13923Feb 2025

Qwen2.5-VL Technical Report(2025)

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the…

arXiv2409.12191Sep 2024

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution(2024)

Peng Wang, Shuai Bai, Sinan Tan et al.

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process…

arXiv2308.12966Aug 2023

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond(2023)

Jinze Bai, Shuai Bai, Shusheng Yang et al.

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual…

Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 1d ago