Wan (Tongyi Wanxiang) Development History: From Open-Source Video Gen to Wan 2.6

Wan (Tongyi Wanxiang) Development History: From Open-Source Video Gen to Wan 2.6

Wan (Tongyi Wanxiang) is Alibaba's open-source video generation model family from the Tongyi lab. It delivers text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) with multi-shot 1080p output, stable characters, and native audio sync. This article traces its development from Wan 2.1 through Wan 2.6—benchmarks, architecture shifts, and open-release milestones.

2025: Wan 2.1 and the Open-Source Breakthrough

In early 2025, Wan 2.1 was released as a full open-source video generative model. In January 2025 it topped the VBench leaderboard, outperforming Sora, HunyuanVideo, and other leading video models. The technical paper Wan: Open and Advanced Large-Scale Video Generative Models was published in March 2025, and the code and weights were opened on GitHub, Hugging Face, and ModelScope under Apache 2.0.

Date Event
Jan 2025 Wan 2.1 tops VBench; 1.3B and 14B models
Feb–Mar 2025 Wan 2.1 open source; 8 tasks including T2V, I2V, video edit
Mar 2025 Technical paper and full code/weights release

Wan 2.1: Scale, Efficiency, and Tasks

Wan video model 3D VAE and diffusion architecture

Video generative architecture and 3D causal VAE

Model Sizes and Efficiency

Wan 2.1 offers two scales: 1.3B for speed and low VRAM (about 8.19 GB), and 14B for best quality. The 1.3B model runs on consumer GPUs such as RTX 4090 with text-to-video latency in the single-digit seconds for a 4-second clip. The 14B model leads on internal and external benchmarks against other open and commercial video models.

Technical Highlights

  • 3D causal VAE: Cuts VRAM use by about 60% while keeping visual quality
  • Bilingual captions: First open video model to generate Chinese and English on-screen text
  • Eight tasks: Text-to-video, image-to-video, instruction-guided video editing, personalized generation, and more
Model Parameters Focus
Wan 2.1 small 1.3B Efficiency, <8 s on RTX 4090 for 4s video
Wan 2.1 large 14B Quality, top VBench and benchmarks

Wan 2.2, 2.5, and 2.6: MoE, Audio, and Multi-Shot

Wan 2.5 audio-visual sync and multi-modal video

Native audio-visual sync and multi-modal video generation

Wan 2.2 (July 2025)

Wan 2.2 introduced MoE (mixture of experts) into open video generation. A 5B-parameter variant runs smoothly on consumer cards (e.g. RTX 4090) at 720p@24fps; a 27B version targets professional-grade visuals.

Wan 2.5 (September 2025)

Wan 2.5 added native audio-visual sync in a single pipeline for vision, language, and sound. It supports 480p, 720p, and 1080p, with video length up to 10 seconds, and accepts text, image, and audio in any combination as input.

Wan 2.6 (December 2025)

Wan 2.6 focuses on multi-shot narrative and character consistency, supports up to 15-second videos, and further improves audio-visual synchronization.

Release Key feature Max duration / resolution
Wan 2.2 MoE for video 5B / 27B; 720p@24fps on consumer GPU
Wan 2.5 Native A/V sync, multi-modal input Up to 10 s; 480p / 720p / 1080p
Wan 2.6 Multi-shot, character consistency Up to 15 s

Summary: Why Wan Matters

Wan establishes Alibaba Tongyi as a major contributor to open video generation. Full code and weights (Apache 2.0), strong benchmarks (e.g. VBench), and a clear path from 2.1 to 2.6—with MoE, native audio, and longer multi-shot narratives—make it a go-to option for researchers and builders who need high-quality, affordable video generation.

Key Takeaways

  • Wan 2.1 (2025): First full open-source release; 1.3B and 14B; topped VBench; 8 tasks including T2V, I2V, editing
  • Wan 2.2: MoE for video; 5B runs on RTX 4090; 27B for pro quality
  • Wan 2.5: Native audio-visual sync; 10 s; 480p–1080p; text/image/audio input
  • Wan 2.6: Multi-shot narrative, character consistency, 15 s
  • All code and weights open under Apache 2.0 on GitHub, Hugging Face, ModelScope

Try Wan on FuseAITools for text-to-video, image-to-video, and video-to-video generation up to 15 seconds with 1080p and native audio.

Disclaimer: Timeline and capabilities are based on public announcements and the paper Wan: Open and Advanced Large-Scale Video Generative Models; see Alibaba Tongyi and official repositories for authoritative details.

This article will be updated as new Wan versions and benchmarks are released.