Z.ai·💬 Text Generation

GLM 4.7 Flash

ReasoningFunction CallingWeb Searchfp8private

🧠 Try in Intelligence →

Try on Venice.ai ↗

Quick reference

GLM 4.7 Flash — TLDR

🆕 Speed-optimized variant of GLM-4.7, released January 2026.
🧠 30B-A3B mixture-of-experts with roughly 3B active parameters.
📏 Roughly 128K-token context window for long inputs.
🔧 Supports function calling and web search.
⚡ Optimized for fast, low-latency inference.
🔒 Open weights under the permissive MIT license.
📚 Targets lightweight, cost-efficient deployment in the 30B class.
🌐 Runs on common inference stacks like vLLM and SGLang.

💰 Pricing

$0.125 / $0.500

per 1M · input / output

📏 Context

128K tokens

📅 On Venice since

Jan 29, 2026

171 days ago

Provider

Z.ai

Z.ai, formally Knowledge Atlas Technology Joint Stock Co., Ltd., is a Chinese technology company specializing in artificial intelligence. Previously known internationally as Zhipu AI, the company rebranded to Z.ai in 2025. Its core focus is the GLM family of…

Read full profile →

12 models on Venice

11 text · 1 image

Since Apr 1, 2024

Wikipedia ↗Official site ↗

See 11 other models from Z.ai →

About this model

GLM 4.7 Flash is Z.ai's speed-optimized, lightweight member of the GLM-4.7 generation, released in early 2026 alongside the full-size GLM 4.7. According to its Hugging Face model card, it is a 30B-A3B mixture-of-experts model—30 billion total parameters with roughly 3 billion active per token—positioned for strong balance of performance and efficiency in the 30B class. It ships as open weights under the MIT license and runs on common inference stacks such as vLLM and SGLang.

Compared with the full GLM 4.7, the Flash variant is built for faster, cheaper inference by activating fewer parameters per token, while retaining the generation's coding focus, function calling, and web-search support. The catalog lists a 128K-token context window and reasoning capability.

The wider GLM-4.7 line advanced over GLM 4.6; Z.ai's GLM-4.7 model card reports the full model reaching 42.8% on Humanity's Last Exam, a gain over GLM-4.6, with improved tool use.

This makes Flash well suited to high-volume, latency-sensitive tasks, with the full GLM 4.7 available for harder jobs.

🤗View model card on HuggingFace ↗View source on GitHub ↗

Sources

zai-org/GLM-4.7-Flash · Hugging Facehuggingface.co ↗

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Research & Papers

Primary reference paper for this model family, sourced from the HuggingFace model card.

arXiv2508.06471Aug 2025

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models(2025)

GLM-4. 5 Team, :, Aohan Zeng et al.

We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and…

Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 4d ago