Wan (Tongyi Wanxiang) Development History: From Open-Source Video Gen to Wan 2.6

Wan (Tongyi Wanxiang) is Alibaba's open-source video generation model family from the Tongyi lab. It delivers text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) with multi-shot 1080p output, stable characters, and native audio sync. This article traces its development from Wan 2.1 through Wan 2.6—benchmarks, architecture shifts, and open-release milestones.

2025: Wan 2.1 and the Open-Source Breakthrough

In early 2025, Wan 2.1 was released as a full open-source video generative model. In January 2025 it topped the VBench leaderboard, outperforming Sora, HunyuanVideo, and other leading video models. The technical paper Wan: Open and Advanced Large-Scale Video Generative Models was published in March 2025, and the code and weights were opened on GitHub, Hugging Face, and ModelScope under Apache 2.0.

Date	Event
Jan 2025	Wan 2.1 tops VBench; 1.3B and 14B models
Feb–Mar 2025	Wan 2.1 open source; 8 tasks including T2V, I2V, video edit
Mar 2025	Technical paper and full code/weights release

Wan 2.1: Scale, Efficiency, and Tasks

Wan video model 3D VAE and diffusion architecture

Video generative architecture and 3D causal VAE

Model Sizes and Efficiency

Wan 2.1 offers two scales: 1.3B for speed and low VRAM (about 8.19 GB), and 14B for best quality. The 1.3B model runs on consumer GPUs such as RTX 4090 with text-to-video latency in the single-digit seconds for a 4-second clip. The 14B model leads on internal and external benchmarks against other open and commercial video models.

Technical Highlights

3D causal VAE: Cuts VRAM use by about 60% while keeping visual quality
Bilingual captions: First open video model to generate Chinese and English on-screen text
Eight tasks: Text-to-video, image-to-video, instruction-guided video editing, personalized generation, and more

Model	Parameters	Focus
Wan 2.1 small	1.3B	Efficiency, <8 s on RTX 4090 for 4s video
Wan 2.1 large	14B	Quality, top VBench and benchmarks

Wan 2.2, 2.5, and 2.6: MoE, Audio, and Multi-Shot

Wan 2.5 audio-visual sync and multi-modal video

Native audio-visual sync and multi-modal video generation

Wan 2.2 (July 2025)

Wan 2.2 introduced MoE (mixture of experts) into open video generation. A 5B-parameter variant runs smoothly on consumer cards (e.g. RTX 4090) at 720p@24fps; a 27B version targets professional-grade visuals.

Wan 2.5 (September 2025)

Wan 2.5 added native audio-visual sync in a single pipeline for vision, language, and sound. It supports 480p, 720p, and 1080p, with video length up to 10 seconds, and accepts text, image, and audio in any combination as input.

Wan 2.6 (December 2025)

Wan 2.6 focuses on multi-shot narrative and character consistency, supports up to 15-second videos, and further improves audio-visual synchronization.

Release	Key feature	Max duration / resolution
Wan 2.2	MoE for video	5B / 27B; 720p@24fps on consumer GPU
Wan 2.5	Native A/V sync, multi-modal input	Up to 10 s; 480p / 720p / 1080p
Wan 2.6	Multi-shot, character consistency	Up to 15 s

Summary: Why Wan Matters

Wan establishes Alibaba Tongyi as a major contributor to open video generation. Full code and weights (Apache 2.0), strong benchmarks (e.g. VBench), and a clear path from 2.1 to 2.6—with MoE, native audio, and longer multi-shot narratives—make it a go-to option for researchers and builders who need high-quality, affordable video generation.

Key Takeaways

Wan 2.1 (2025): First full open-source release; 1.3B and 14B; topped VBench; 8 tasks including T2V, I2V, editing
Wan 2.2: MoE for video; 5B runs on RTX 4090; 27B for pro quality
Wan 2.5: Native audio-visual sync; 10 s; 480p–1080p; text/image/audio input
Wan 2.6: Multi-shot narrative, character consistency, 15 s
All code and weights open under Apache 2.0 on GitHub, Hugging Face, ModelScope

Try Wan on FuseAITools for text-to-video, image-to-video, and video-to-video generation up to 15 seconds with 1080p and native audio.