Meituan·🎬 Video Generation

Longcat Full Quality

private

Try on Venice.ai ↗

Quick reference

Longcat Full Quality — TLDR

- 🏢 Meituan's open-source LongCat-Video, image-to-video full-quality variant.
- 🆕 Single Diffusion Transformer handles text-to-video, image-to-video, video continuation.
- 📏 Targets coherent, minutes-long sequences with consistent subjects.
- 👁️ Outputs 720p at 30fps via coarse-to-fine generation.
- 🧠 13.6B-parameter dense architecture using Block Sparse Attention.
- ⚡ Distilled sibling variants are offered for faster inference.
- 🔒 Weights released under the permissive MIT License.
- 🎯 Pretrained on video-continuation for temporal coherence.

💰 Pricing

$0.250 – $1.52

per generation

📅 On Venice since

Dec 4, 2025

227 days ago

Provider

Meituan

Meituan is a Chinese technology company founded in 2010 by Wang Xing and headquartered in Beijing. Best known for its massive local services platform — spanning on-demand food delivery, consumer reviews, hotel bookings, and instant retail — Meituan listed on…

Read full profile →

4 models on Venice

4 video

Added Dec 4, 2025

Wikipedia ↗Official site ↗

See 3 other models from Meituan →

About this model

Longcat Full Quality is the image-to-video member of Meituan's LongCat-Video family, an open-source foundational video generator first detailed in the LongCat-Video Technical Report. It is built on a Diffusion Transformer (DiT) framework and, unusually, uses a single unified model to serve text-to-video, image-to-video, and video-continuation tasks. Per Meituan's report, pretraining on the video-continuation objective is what lets it sustain quality and temporal coherence across minutes-long clips, and it produces 720p, 30fps output using a coarse-to-fine strategy along both temporal and spatial axes.

Within the family, this "Full Quality" image-to-video model prioritizes output fidelity, while its companion Longcat Distilled applies step-distillation for faster generation. The same split exists for the text-conditioned variants: the Longcat Full Quality text-to-video model and the distilled Longcat Distilled text-to-video model. Choosing Full Quality means accepting longer inference time in exchange for higher visual quality versus the distilled path.

The underlying system is a 13.6B-parameter dense model that, per Meituan's technical report, employs Block Sparse Attention. The weights are distributed under the MIT License, which permits broad commercial and research use.

In the image-to-video configuration, the model takes a still input image and generates video output with consistent subjects, optionally guided by a text prompt.

🤗View model card on HuggingFace ↗View source on GitHub ↗

Sources

[2510.22200] LongCat-Video Technical Reportarxiv.org ↗

meituan-longcat/LongCat-Video-Avatar-1.5 · Hugging Facehuggingface.co ↗

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Research & Papers

Primary reference paper for this model family, sourced from the HuggingFace model card.

arXiv2510.22200Oct 2025

LongCat-Video Technical Report(2025)

Meituan LongCat Team, Xunliang Cai, Qilong Huang et al.

Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video…

Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 4d ago