About this model
Longcat Full Quality is the image-to-video member of Meituan's LongCat-Video family, an open-source foundational video generator first detailed in the LongCat-Video Technical Report. It is built on a Diffusion Transformer (DiT) framework and, unusually, uses a single unified model to serve text-to-video, image-to-video, and video-continuation tasks. Per Meituan's report, pretraining on the video-continuation objective is what lets it sustain quality and temporal coherence across minutes-long clips, and it produces 720p, 30fps output using a coarse-to-fine strategy along both temporal and spatial axes.
Within the family, this "Full Quality" image-to-video model prioritizes output fidelity, while its companion Longcat Distilled applies step-distillation for faster generation. The same split exists for the text-conditioned variants: the Longcat Full Quality text-to-video model and the distilled Longcat Distilled text-to-video model. Choosing Full Quality means accepting longer inference time in exchange for higher visual quality versus the distilled path.
The underlying system is a 13.6B-parameter dense model that, per Meituan's technical report, employs Block Sparse Attention. The weights are distributed under the MIT License, which permits broad commercial and research use.
In the image-to-video configuration, the model takes a still input image and generates video output with consistent subjects, optionally guided by a text prompt.
This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.
Research & Papers
Primary reference paper for this model family, sourced from the HuggingFace model card.
Data sources: Venice API · HuggingFace · Wikipedia · arXiv — enrichment updated 1d ago