Introduction: Alibaba Cloud's Wan Video Empire
In 2026, the Wan family is building a complete AI video matrix: from text-to-video and image-driven animation to video editing and multimodal reference fusion. According to public performance discussions, Wan2.1 has become one of the strongest global video foundation stacks in practical evaluations.
But for most users, the biggest confusion remains the same: what are the real differences between text-to-video, image-to-video, video-to-video, video-edit, and r2v, and which version should I use first?
This article maps all seven Wan variants based on complete API parameter structures and helps you make route-level decisions quickly.
Wan tool hub: /home/wan
I. Family Snapshot: One Table to Understand All Seven
| No. | Model version | Core function | Input type | Duration | Unique advantage | Best scenario |
|---|---|---|---|---|---|---|
| 1 | 2.6 text-to-video | Text to video | Prompt | 5/10/15s | Up to 15s, detail-friendly | Creative generation, ad drafts |
| 2 | 2.6 image-to-video | Image to video | One image + prompt | 5/10/15s | Image-driven animation | Animate static visuals |
| 3 | 2.6 video-to-video | Video to video | 1-3 videos + prompt | 5/10s | Multi-video blending | Style transfer, reenactment |
| 4 | 2.7 text-to-video | Enhanced text to video | Prompt + negative prompt | 2-15s | Negative prompt + rewrite | Precise generation control |
| 5 | 2.7 image-to-video | Enhanced image to video | First/last frame + audio | 2-15s | Keyframe and audio-driven | Storyboard-like animation |
| 6 | 2.7 video-edit | Video editing | Video + ref image + prompt | 2-10s | Local editing + ref control | Outfit and background swap |
| 7 | 2.7 r2v | Reference-to-video | Image + video + audio | 2-10s | Multimodal fusion | Complex character consistency |
II. Seven Models Deep Dive
Model 1: 2.6 text-to-video (Standard)
This is the base Wan text-to-video model that converts prompts directly into short videos.
{
"prompt": "Text prompt (max 5000 chars)",
"duration": "5 / 10 / 15",
"resolution": "720p / 1080p"
}
Highlights: up to 15 seconds; bilingual prompts; suitable for longer short-form concepts.
Use cases: ad drafts, concept videos, ASMR-style visuals, product demos.
Model 2: 2.6 image-to-video
Turns one static image plus a prompt into animated output.
{
"prompt": "Video prompt",
"image_urls": ["Image URL (max 1)"],
"duration": "5 / 10 / 15",
"resolution": "720p / 1080p"
}
Limits: one image only; min 256x256; max 10MB; jpeg/png/webp.
Use cases: animate old photos, illustration motion, dynamic product display.
Model 3: 2.6 video-to-video
Generates a new video from source video inputs plus a prompt.
{
"prompt": "Video prompt",
"video_urls": ["Video URL (max 3)"],
"duration": "5 / 10",
"resolution": "720p / 1080p"
}
Highlights: up to 3 source videos; duration supports 5s or 10s only; mp4/mov/mkv; max 10MB each.
Use cases: style transfer, motion reenactment, multi-video fusion.
Model 4: 2.7 text-to-video (Enhanced)
Upgraded version of 2.6 text-to-video with stronger control tools.
{
"prompt": "Positive prompt (max 5000 chars)",
"negative_prompt": "Negative prompt (max 500 chars)",
"audio_url": "Optional custom audio URL",
"resolution": "720p / 1080p",
"ratio": "16:9 / 9:16 / 1:1 / 4:3 / 3:4",
"duration": 5,
"prompt_extend": true,
"watermark": false,
"seed": "Optional"
}
Core gains: negative prompt, ratio control, prompt rewrite, watermark control, reproducible seed.
Use cases: professional production requiring fine-grained generation direction.
Model 5: 2.7 image-to-video (Enhanced)
Upgraded image animation model with keyframe and audio-driven options.
{
"prompt": "Positive prompt (max 5000 chars)",
"negative_prompt": "Negative prompt (max 500 chars)",
"first_frame_url": "First frame image URL",
"last_frame_url": "Last frame image URL",
"first_clip_url": "First clip URL (continuation)",
"driving_audio_url": "Driving audio URL",
"resolution": "720p / 1080p",
"duration": 5,
"prompt_extend": true,
"watermark": false,
"seed": "Optional"
}
Core gains: first+last frame control, audio-driven motion, continuation workflow.
Use cases: storyboard animation, audio-driven characters, clip continuation.
Model 6: 2.7 video-edit
Dedicated editing route for existing videos with optional visual references.
{
"prompt": "Text prompt (max 5000 chars)",
"negative_prompt": "Negative prompt (max 500 chars)",
"video_url": "Source video URL",
"reference_image": "Reference image URL (optional)",
"resolution": "720p / 1080p",
"aspect_ratio": "16:9 / 9:16 / 1:1 / 4:3 / 3:4",
"duration": 0,
"audio_setting": "auto / origin",
"prompt_extend": true,
"watermark": false,
"seed": "Optional"
}
Key notes: video_url is required; duration 0 keeps source length; [2,10] trims first N seconds.
Use cases: outfit replacement, background swap, local scene edits.
Model 7: 2.7 r2v
Most advanced Wan variant for multimodal reference fusion.
{
"prompt": "Text prompt (max 5000 chars)",
"negative_prompt": "Negative prompt (max 500 chars)",
"reference_image": ["Image URLs (max 5)"],
"reference_video": ["Video URLs (max 5)"],
"first_frame": "First frame image URL",
"reference_voice": "Voice URL",
"resolution": "720p / 1080p",
"aspect_ratio": "16:9 / 9:16 / 1:1 / 4:3 / 3:4",
"duration": 5,
"prompt_extend": true,
"watermark": false,
"seed": "Optional"
}
Limits: image refs + video refs must be <= 5; reference voice 1-10s, max 15MB; first frame overrides aspect ratio.
Use cases: complex character control and multimodal consistency generation.
III. Seven-Model Comparison Summary
3.1 Core feature matrix
| Model | Text to video | Image to video | Video editing | First/last frame | Audio driving | Reference fusion |
|---|---|---|---|---|---|---|
| 2.6 text-to-video | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| 2.6 image-to-video | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| 2.6 video-to-video | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| 2.7 text-to-video | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ |
| 2.7 image-to-video | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ |
| 2.7 video-edit | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ |
| 2.7 r2v | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
3.2 Parameter complexity matrix
| Model | Parameter count | Learning curve | Best for |
|---|---|---|---|
| 2.6 text-to-video | 3 | Easy | Beginners |
| 2.6 image-to-video | 4 | Easy | Beginners |
| 2.6 video-to-video | 4 | Easy | Beginners |
| 2.7 text-to-video | 9 | Medium | Intermediate users |
| 2.7 image-to-video | 11 | Medium | Intermediate users |
| 2.7 video-edit | 11 | Medium | Intermediate users |
| 2.7 r2v | 13 | Complex | Advanced users |
IV. Model Selection Decision Tree
What is your task?
|
|-- Generate video from text
| |-- Fast and simple needs -> 2.6 text-to-video
| `-- Precise control needed -> 2.7 text-to-video
|
|-- Animate still images
| |-- Basic animation -> 2.6 image-to-video
| `-- Keyframe/audio control needed -> 2.7 image-to-video
|
|-- Edit existing videos
| |-- Style transfer -> 2.6 video-to-video
| `-- Local edits and replacement -> 2.7 video-edit
|
`-- Complex multimodal generation
`-- Image + video + audio fusion -> 2.7 r2v
V. Final Recommendations
| Use case | Recommended model | Core reason |
|---|---|---|
| Fast text-to-video | 2.6 text-to-video | Simple parameters, up to 15s |
| Precise text-to-video | 2.7 text-to-video | Negative prompt + ratio control |
| Static image animation | 2.6 image-to-video | Simple and direct workflow |
| First/last-frame animation | 2.7 image-to-video | Keyframe + audio-driven control |
| Video style transfer | 2.6 video-to-video | Multi-video input support |
| Local video editing | 2.7 video-edit | Reference image + precise edits |
| Complex multimodal generation | 2.7 r2v | All capabilities integrated |
One-line summary:
- Daily text generation: 2.6 text-to-video
- Precision control: 2.7 text-to-video
- Static image animation: 2.6/2.7 image-to-video
- Video editing and replacement: 2.7 video-edit
- Complex multimodal reference: 2.7 r2v
Ready to start? All seven parameter sets can be tested directly in Wan routes. Explore from the hub page: /home/wan.
