Wan Video Model Family Comparison: How to Choose the Right Version Across Seven Core Workflows

Introduction: Alibaba Cloud's Wan Video Empire

In 2026, the Wan family is building a complete AI video matrix: from text-to-video and image-driven animation to video editing and multimodal reference fusion. According to public performance discussions, Wan2.1 has become one of the strongest global video foundation stacks in practical evaluations.

But for most users, the biggest confusion remains the same: what are the real differences between text-to-video, image-to-video, video-to-video, video-edit, and r2v, and which version should I use first?

This article maps all seven Wan variants based on complete API parameter structures and helps you make route-level decisions quickly.

Wan tool hub: /home/wan

I. Family Snapshot: One Table to Understand All Seven

No.	Model version	Core function	Input type	Duration	Unique advantage	Best scenario
1	2.6 text-to-video	Text to video	Prompt	5/10/15s	Up to 15s, detail-friendly	Creative generation, ad drafts
2	2.6 image-to-video	Image to video	One image + prompt	5/10/15s	Image-driven animation	Animate static visuals
3	2.6 video-to-video	Video to video	1-3 videos + prompt	5/10s	Multi-video blending	Style transfer, reenactment
4	2.7 text-to-video	Enhanced text to video	Prompt + negative prompt	2-15s	Negative prompt + rewrite	Precise generation control
5	2.7 image-to-video	Enhanced image to video	First/last frame + audio	2-15s	Keyframe and audio-driven	Storyboard-like animation
6	2.7 video-edit	Video editing	Video + ref image + prompt	2-10s	Local editing + ref control	Outfit and background swap
7	2.7 r2v	Reference-to-video	Image + video + audio	2-10s	Multimodal fusion	Complex character consistency

II. Seven Models Deep Dive

Model 1: 2.6 text-to-video (Standard)

This is the base Wan text-to-video model that converts prompts directly into short videos.

{
  "prompt": "Text prompt (max 5000 chars)",
  "duration": "5 / 10 / 15",
  "resolution": "720p / 1080p"
}

Highlights: up to 15 seconds; bilingual prompts; suitable for longer short-form concepts.

Use cases: ad drafts, concept videos, ASMR-style visuals, product demos.

Model 2: 2.6 image-to-video

Turns one static image plus a prompt into animated output.

{
  "prompt": "Video prompt",
  "image_urls": ["Image URL (max 1)"],
  "duration": "5 / 10 / 15",
  "resolution": "720p / 1080p"
}

Limits: one image only; min 256x256; max 10MB; jpeg/png/webp.

Use cases: animate old photos, illustration motion, dynamic product display.

Model 3: 2.6 video-to-video

Generates a new video from source video inputs plus a prompt.

{
  "prompt": "Video prompt",
  "video_urls": ["Video URL (max 3)"],
  "duration": "5 / 10",
  "resolution": "720p / 1080p"
}

Highlights: up to 3 source videos; duration supports 5s or 10s only; mp4/mov/mkv; max 10MB each.

Use cases: style transfer, motion reenactment, multi-video fusion.

Model 4: 2.7 text-to-video (Enhanced)

Upgraded version of 2.6 text-to-video with stronger control tools.

{
  "prompt": "Positive prompt (max 5000 chars)",
  "negative_prompt": "Negative prompt (max 500 chars)",
  "audio_url": "Optional custom audio URL",
  "resolution": "720p / 1080p",
  "ratio": "16:9 / 9:16 / 1:1 / 4:3 / 3:4",
  "duration": 5,
  "prompt_extend": true,
  "watermark": false,
  "seed": "Optional"
}

Core gains: negative prompt, ratio control, prompt rewrite, watermark control, reproducible seed.

Use cases: professional production requiring fine-grained generation direction.

Model 5: 2.7 image-to-video (Enhanced)

Upgraded image animation model with keyframe and audio-driven options.

{
  "prompt": "Positive prompt (max 5000 chars)",
  "negative_prompt": "Negative prompt (max 500 chars)",
  "first_frame_url": "First frame image URL",
  "last_frame_url": "Last frame image URL",
  "first_clip_url": "First clip URL (continuation)",
  "driving_audio_url": "Driving audio URL",
  "resolution": "720p / 1080p",
  "duration": 5,
  "prompt_extend": true,
  "watermark": false,
  "seed": "Optional"
}

Core gains: first+last frame control, audio-driven motion, continuation workflow.

Use cases: storyboard animation, audio-driven characters, clip continuation.

Model 6: 2.7 video-edit

Dedicated editing route for existing videos with optional visual references.

{
  "prompt": "Text prompt (max 5000 chars)",
  "negative_prompt": "Negative prompt (max 500 chars)",
  "video_url": "Source video URL",
  "reference_image": "Reference image URL (optional)",
  "resolution": "720p / 1080p",
  "aspect_ratio": "16:9 / 9:16 / 1:1 / 4:3 / 3:4",
  "duration": 0,
  "audio_setting": "auto / origin",
  "prompt_extend": true,
  "watermark": false,
  "seed": "Optional"
}

Key notes: video_url is required; duration 0 keeps source length; [2,10] trims first N seconds.

Use cases: outfit replacement, background swap, local scene edits.

Model 7: 2.7 r2v

Most advanced Wan variant for multimodal reference fusion.

{
  "prompt": "Text prompt (max 5000 chars)",
  "negative_prompt": "Negative prompt (max 500 chars)",
  "reference_image": ["Image URLs (max 5)"],
  "reference_video": ["Video URLs (max 5)"],
  "first_frame": "First frame image URL",
  "reference_voice": "Voice URL",
  "resolution": "720p / 1080p",
  "aspect_ratio": "16:9 / 9:16 / 1:1 / 4:3 / 3:4",
  "duration": 5,
  "prompt_extend": true,
  "watermark": false,
  "seed": "Optional"
}

Limits: image refs + video refs must be <= 5; reference voice 1-10s, max 15MB; first frame overrides aspect ratio.

Use cases: complex character control and multimodal consistency generation.

III. Seven-Model Comparison Summary

3.1 Core feature matrix

Model	Text to video	Image to video	Video editing	First/last frame	Audio driving	Reference fusion
2.6 text-to-video	✅	❌	❌	❌	❌	❌
2.6 image-to-video	❌	✅	❌	❌	❌	❌
2.6 video-to-video	❌	❌	✅	❌	❌	❌
2.7 text-to-video	✅	❌	❌	❌	✅	❌
2.7 image-to-video	❌	✅	❌	✅	✅	❌
2.7 video-edit	❌	❌	✅	❌	❌	✅
2.7 r2v	✅	✅	✅	✅	✅	✅

3.2 Parameter complexity matrix

Model	Parameter count	Learning curve	Best for
2.6 text-to-video	3	Easy	Beginners
2.6 image-to-video	4	Easy	Beginners
2.6 video-to-video	4	Easy	Beginners
2.7 text-to-video	9	Medium	Intermediate users
2.7 image-to-video	11	Medium	Intermediate users
2.7 video-edit	11	Medium	Intermediate users
2.7 r2v	13	Complex	Advanced users

IV. Model Selection Decision Tree

What is your task?
|
|-- Generate video from text
|   |-- Fast and simple needs -> 2.6 text-to-video
|   `-- Precise control needed -> 2.7 text-to-video
|
|-- Animate still images
|   |-- Basic animation -> 2.6 image-to-video
|   `-- Keyframe/audio control needed -> 2.7 image-to-video
|
|-- Edit existing videos
|   |-- Style transfer -> 2.6 video-to-video
|   `-- Local edits and replacement -> 2.7 video-edit
|
`-- Complex multimodal generation
    `-- Image + video + audio fusion -> 2.7 r2v

V. Final Recommendations

Use case	Recommended model	Core reason
Fast text-to-video	2.6 text-to-video	Simple parameters, up to 15s
Precise text-to-video	2.7 text-to-video	Negative prompt + ratio control
Static image animation	2.6 image-to-video	Simple and direct workflow
First/last-frame animation	2.7 image-to-video	Keyframe + audio-driven control
Video style transfer	2.6 video-to-video	Multi-video input support
Local video editing	2.7 video-edit	Reference image + precise edits
Complex multimodal generation	2.7 r2v	All capabilities integrated

One-line summary:

Daily text generation: 2.6 text-to-video
Precision control: 2.7 text-to-video
Static image animation: 2.6/2.7 image-to-video
Video editing and replacement: 2.7 video-edit
Complex multimodal reference: 2.7 r2v

Ready to start? All seven parameter sets can be tested directly in Wan routes. Explore from the hub page: /home/wan.