MeituanMeituan·🎬 Video Generation

Longcat Full Quality

private
Try on Venice.ai ↗
Quick reference
Longcat Full Quality — TLDR
  • - 🏢 Meituan's open-source LongCat-Video, image-to-video full-quality variant.
  • - 🆕 Single Diffusion Transformer handles text-to-video, image-to-video, video continuation.
  • - 📏 Targets coherent, minutes-long sequences with consistent subjects.
  • - 👁️ Outputs 720p at 30fps via coarse-to-fine generation.
  • - 🧠 13.6B-parameter dense architecture using Block Sparse Attention.
  • - ⚡ Distilled sibling variants are offered for faster inference.
  • - 🔒 Weights released under the permissive MIT License.
  • - 🎯 Pretrained on video-continuation for temporal coherence.
💰 Pricing
$0.250 – $1.52
per generation
📅 On Venice since
Dec 4, 2025
182 days ago
Provider

Meituan is a Chinese technology company founded in 2010 by Wang Xing and headquartered in Beijing. Best known for its massive local services platform — spanning on-demand food delivery, consumer reviews, hotel bookings, and instant retail — Meituan listed on…

Read full profile →
4 models on Venice
4 video
Added Dec 4, 2025

About this model

Longcat Full Quality is the image-to-video member of Meituan's LongCat-Video family, an open-source foundational video generator first detailed in the LongCat-Video Technical Report. It is built on a Diffusion Transformer (DiT) framework and, unusually, uses a single unified model to serve text-to-video, image-to-video, and video-continuation tasks. Per Meituan's report, pretraining on the video-continuation objective is what lets it sustain quality and temporal coherence across minutes-long clips, and it produces 720p, 30fps output using a coarse-to-fine strategy along both temporal and spatial axes.

Within the family, this "Full Quality" image-to-video model prioritizes output fidelity, while its companion Longcat Distilled applies step-distillation for faster generation. The same split exists for the text-conditioned variants: the Longcat Full Quality text-to-video model and the distilled Longcat Distilled text-to-video model. Choosing Full Quality means accepting longer inference time in exchange for higher visual quality versus the distilled path.

The underlying system is a 13.6B-parameter dense model that, per Meituan's technical report, employs Block Sparse Attention. The weights are distributed under the MIT License, which permits broad commercial and research use.

In the image-to-video configuration, the model takes a still input image and generates video output with consistent subjects, optionally guided by a text prompt.

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Research & Papers

Primary reference paper for this model family, sourced from the HuggingFace model card.

Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 1d ago