Wan Video Model Family Comparison: How to Choose the Right Version Across Seven Core Workflows

Introduction: Alibaba Cloud's Wan Video Empire

In 2026, the Wan family is building a complete AI video matrix: from text-to-video and image-driven animation to video editing and multimodal reference fusion. According to public performance discussions, Wan2.1 has become one of the strongest global video foundation stacks in practical evaluations.

But for most users, the biggest confusion remains the same: what are the real differences between text-to-video, image-to-video, video-to-video, video-edit, and r2v, and which version should I use first?

This article maps all seven Wan variants based on complete API parameter structures and helps you make route-level decisions quickly.

Wan tool hub: /home/wan

I. Family Snapshot: One Table to Understand All Seven

No. Model version Core function Input type Duration Unique advantage Best scenario
12.6 text-to-videoText to videoPrompt5/10/15sUp to 15s, detail-friendlyCreative generation, ad drafts
22.6 image-to-videoImage to videoOne image + prompt5/10/15sImage-driven animationAnimate static visuals
32.6 video-to-videoVideo to video1-3 videos + prompt5/10sMulti-video blendingStyle transfer, reenactment
42.7 text-to-videoEnhanced text to videoPrompt + negative prompt2-15sNegative prompt + rewritePrecise generation control
52.7 image-to-videoEnhanced image to videoFirst/last frame + audio2-15sKeyframe and audio-drivenStoryboard-like animation
62.7 video-editVideo editingVideo + ref image + prompt2-10sLocal editing + ref controlOutfit and background swap
72.7 r2vReference-to-videoImage + video + audio2-10sMultimodal fusionComplex character consistency

II. Seven Models Deep Dive

Model 1: 2.6 text-to-video (Standard)

This is the base Wan text-to-video model that converts prompts directly into short videos.

{
  "prompt": "Text prompt (max 5000 chars)",
  "duration": "5 / 10 / 15",
  "resolution": "720p / 1080p"
}

Highlights: up to 15 seconds; bilingual prompts; suitable for longer short-form concepts.

Use cases: ad drafts, concept videos, ASMR-style visuals, product demos.

Model 2: 2.6 image-to-video

Turns one static image plus a prompt into animated output.

{
  "prompt": "Video prompt",
  "image_urls": ["Image URL (max 1)"],
  "duration": "5 / 10 / 15",
  "resolution": "720p / 1080p"
}

Limits: one image only; min 256x256; max 10MB; jpeg/png/webp.

Use cases: animate old photos, illustration motion, dynamic product display.

Model 3: 2.6 video-to-video

Generates a new video from source video inputs plus a prompt.

{
  "prompt": "Video prompt",
  "video_urls": ["Video URL (max 3)"],
  "duration": "5 / 10",
  "resolution": "720p / 1080p"
}

Highlights: up to 3 source videos; duration supports 5s or 10s only; mp4/mov/mkv; max 10MB each.

Use cases: style transfer, motion reenactment, multi-video fusion.

Model 4: 2.7 text-to-video (Enhanced)

Upgraded version of 2.6 text-to-video with stronger control tools.

{
  "prompt": "Positive prompt (max 5000 chars)",
  "negative_prompt": "Negative prompt (max 500 chars)",
  "audio_url": "Optional custom audio URL",
  "resolution": "720p / 1080p",
  "ratio": "16:9 / 9:16 / 1:1 / 4:3 / 3:4",
  "duration": 5,
  "prompt_extend": true,
  "watermark": false,
  "seed": "Optional"
}

Core gains: negative prompt, ratio control, prompt rewrite, watermark control, reproducible seed.

Use cases: professional production requiring fine-grained generation direction.

Model 5: 2.7 image-to-video (Enhanced)

Upgraded image animation model with keyframe and audio-driven options.

{
  "prompt": "Positive prompt (max 5000 chars)",
  "negative_prompt": "Negative prompt (max 500 chars)",
  "first_frame_url": "First frame image URL",
  "last_frame_url": "Last frame image URL",
  "first_clip_url": "First clip URL (continuation)",
  "driving_audio_url": "Driving audio URL",
  "resolution": "720p / 1080p",
  "duration": 5,
  "prompt_extend": true,
  "watermark": false,
  "seed": "Optional"
}

Core gains: first+last frame control, audio-driven motion, continuation workflow.

Use cases: storyboard animation, audio-driven characters, clip continuation.

Model 6: 2.7 video-edit

Dedicated editing route for existing videos with optional visual references.

{
  "prompt": "Text prompt (max 5000 chars)",
  "negative_prompt": "Negative prompt (max 500 chars)",
  "video_url": "Source video URL",
  "reference_image": "Reference image URL (optional)",
  "resolution": "720p / 1080p",
  "aspect_ratio": "16:9 / 9:16 / 1:1 / 4:3 / 3:4",
  "duration": 0,
  "audio_setting": "auto / origin",
  "prompt_extend": true,
  "watermark": false,
  "seed": "Optional"
}

Key notes: video_url is required; duration 0 keeps source length; [2,10] trims first N seconds.

Use cases: outfit replacement, background swap, local scene edits.

Model 7: 2.7 r2v

Most advanced Wan variant for multimodal reference fusion.

{
  "prompt": "Text prompt (max 5000 chars)",
  "negative_prompt": "Negative prompt (max 500 chars)",
  "reference_image": ["Image URLs (max 5)"],
  "reference_video": ["Video URLs (max 5)"],
  "first_frame": "First frame image URL",
  "reference_voice": "Voice URL",
  "resolution": "720p / 1080p",
  "aspect_ratio": "16:9 / 9:16 / 1:1 / 4:3 / 3:4",
  "duration": 5,
  "prompt_extend": true,
  "watermark": false,
  "seed": "Optional"
}

Limits: image refs + video refs must be <= 5; reference voice 1-10s, max 15MB; first frame overrides aspect ratio.

Use cases: complex character control and multimodal consistency generation.

III. Seven-Model Comparison Summary

3.1 Core feature matrix

Model Text to video Image to video Video editing First/last frame Audio driving Reference fusion
2.6 text-to-video
2.6 image-to-video
2.6 video-to-video
2.7 text-to-video
2.7 image-to-video
2.7 video-edit
2.7 r2v

3.2 Parameter complexity matrix

Model Parameter count Learning curve Best for
2.6 text-to-video3EasyBeginners
2.6 image-to-video4EasyBeginners
2.6 video-to-video4EasyBeginners
2.7 text-to-video9MediumIntermediate users
2.7 image-to-video11MediumIntermediate users
2.7 video-edit11MediumIntermediate users
2.7 r2v13ComplexAdvanced users

IV. Model Selection Decision Tree

What is your task?
|
|-- Generate video from text
|   |-- Fast and simple needs -> 2.6 text-to-video
|   `-- Precise control needed -> 2.7 text-to-video
|
|-- Animate still images
|   |-- Basic animation -> 2.6 image-to-video
|   `-- Keyframe/audio control needed -> 2.7 image-to-video
|
|-- Edit existing videos
|   |-- Style transfer -> 2.6 video-to-video
|   `-- Local edits and replacement -> 2.7 video-edit
|
`-- Complex multimodal generation
    `-- Image + video + audio fusion -> 2.7 r2v

V. Final Recommendations

Use case Recommended model Core reason
Fast text-to-video2.6 text-to-videoSimple parameters, up to 15s
Precise text-to-video2.7 text-to-videoNegative prompt + ratio control
Static image animation2.6 image-to-videoSimple and direct workflow
First/last-frame animation2.7 image-to-videoKeyframe + audio-driven control
Video style transfer2.6 video-to-videoMulti-video input support
Local video editing2.7 video-editReference image + precise edits
Complex multimodal generation2.7 r2vAll capabilities integrated

One-line summary:

  • Daily text generation: 2.6 text-to-video
  • Precision control: 2.7 text-to-video
  • Static image animation: 2.6/2.7 image-to-video
  • Video editing and replacement: 2.7 video-edit
  • Complex multimodal reference: 2.7 r2v

Ready to start? All seven parameter sets can be tested directly in Wan routes. Explore from the hub page: /home/wan.