Wan (Tongyi Wanxiang) Development History: From Open-Source Video Gen to Wan 2.6
Wan (Tongyi Wanxiang) is Alibaba's open-source video generation model family from the Tongyi lab. It delivers text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) with multi-shot 1080p output, stable characters, and native audio sync. This article traces its development from Wan 2.1 through Wan 2.6—benchmarks, architecture shifts, and open-release milestones.
2025: Wan 2.1 and the Open-Source Breakthrough
In early 2025, Wan 2.1 was released as a full open-source video generative model. In January 2025 it topped the VBench leaderboard, outperforming Sora, HunyuanVideo, and other leading video models. The technical paper Wan: Open and Advanced Large-Scale Video Generative Models was published in March 2025, and the code and weights were opened on GitHub, Hugging Face, and ModelScope under Apache 2.0.
| Date | Event |
|---|---|
| Jan 2025 | Wan 2.1 tops VBench; 1.3B and 14B models |
| Feb–Mar 2025 | Wan 2.1 open source; 8 tasks including T2V, I2V, video edit |
| Mar 2025 | Technical paper and full code/weights release |
Wan 2.1: Scale, Efficiency, and Tasks
Video generative architecture and 3D causal VAE
Model Sizes and Efficiency
Wan 2.1 offers two scales: 1.3B for speed and low VRAM (about 8.19 GB), and 14B for best quality. The 1.3B model runs on consumer GPUs such as RTX 4090 with text-to-video latency in the single-digit seconds for a 4-second clip. The 14B model leads on internal and external benchmarks against other open and commercial video models.
Technical Highlights
- 3D causal VAE: Cuts VRAM use by about 60% while keeping visual quality
- Bilingual captions: First open video model to generate Chinese and English on-screen text
- Eight tasks: Text-to-video, image-to-video, instruction-guided video editing, personalized generation, and more
| Model | Parameters | Focus |
|---|---|---|
| Wan 2.1 small | 1.3B | Efficiency, <8 s on RTX 4090 for 4s video |
| Wan 2.1 large | 14B | Quality, top VBench and benchmarks |
Wan 2.2, 2.5, and 2.6: MoE, Audio, and Multi-Shot
Native audio-visual sync and multi-modal video generation
Wan 2.2 (July 2025)
Wan 2.2 introduced MoE (mixture of experts) into open video generation. A 5B-parameter variant runs smoothly on consumer cards (e.g. RTX 4090) at 720p@24fps; a 27B version targets professional-grade visuals.
Wan 2.5 (September 2025)
Wan 2.5 added native audio-visual sync in a single pipeline for vision, language, and sound. It supports 480p, 720p, and 1080p, with video length up to 10 seconds, and accepts text, image, and audio in any combination as input.
Wan 2.6 (December 2025)
Wan 2.6 focuses on multi-shot narrative and character consistency, supports up to 15-second videos, and further improves audio-visual synchronization.
| Release | Key feature | Max duration / resolution |
|---|---|---|
| Wan 2.2 | MoE for video | 5B / 27B; 720p@24fps on consumer GPU |
| Wan 2.5 | Native A/V sync, multi-modal input | Up to 10 s; 480p / 720p / 1080p |
| Wan 2.6 | Multi-shot, character consistency | Up to 15 s |
Summary: Why Wan Matters
Wan establishes Alibaba Tongyi as a major contributor to open video generation. Full code and weights (Apache 2.0), strong benchmarks (e.g. VBench), and a clear path from 2.1 to 2.6—with MoE, native audio, and longer multi-shot narratives—make it a go-to option for researchers and builders who need high-quality, affordable video generation.
Key Takeaways
- Wan 2.1 (2025): First full open-source release; 1.3B and 14B; topped VBench; 8 tasks including T2V, I2V, editing
- Wan 2.2: MoE for video; 5B runs on RTX 4090; 27B for pro quality
- Wan 2.5: Native audio-visual sync; 10 s; 480p–1080p; text/image/audio input
- Wan 2.6: Multi-shot narrative, character consistency, 15 s
- All code and weights open under Apache 2.0 on GitHub, Hugging Face, ModelScope
Try Wan on FuseAITools for text-to-video, image-to-video, and video-to-video generation up to 15 seconds with 1080p and native audio.