Demo Videos of Talking Avatar Generation Models

Sonic

Input: Image without Emotion + Audio with Emotion

Angry Audio

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=0.5

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=0.7

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=1

Crying Audio

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=0.7

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=1

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=1.5

Laugh Audio

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=0.5

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=0.7

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=1

Crying | Angry | Laugh

Input Image

Common Thumbnail

Output | Crying | 6sec| DynamicScale=0.5

Input Image

Common Thumbnail

Output | Angry | 6sec| DynamicScale=0.5

Input Image

Common Thumbnail

Output | Laugh | 6sec| DynamicScale=0.5

Rap Song 1

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=1

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=1.5

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=2

Rap Song 1 | Output Duration: 6sec vs. 10sec

Input Image

Common Thumbnail

Output | 6sec| DynamicScale=1.5

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=1.5

Rap Song 2

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=1

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=1.5

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=2

Input: Image with Emotion + Audio with Emotion

Laugh Audio 2

Input Image

Common Thumbnail

Output | 6sec| DynamicScale=1.1

Input Image

Common Thumbnail

Output | 6sec| DynamicScale=1.3

Input Image

Common Thumbnail

Output | 6sec| DynamicScale=1.5

Input Image

Common Thumbnail

Output | 6sec| DynamicScale=1.8

Input Image

Common Thumbnail

Output | 6sec| DynamicScale=2

Input Image

Common Thumbnail

Output | 13sec| DynamicScale=2

Laugh Audio 1

Input Image

Common Thumbnail

Output | 6sec| DynamicScale=0.7

Input Image

Common Thumbnail

Output | 6sec| DynamicScale=1

Angry Audio| Output Duration: 6sec vs. 12sec

Input Image

Common Thumbnail

Output | 6sec| DynamicScale=1

Input Image

Common Thumbnail

Output | 12sec| DynamicScale=1

Rap 2 | Crying

Input Image

Common Thumbnail

Output | 10sec| DynamicScale=1 | Rap2

Input Image

Common Thumbnail

Output | 6sec| DynamicScale=1 | Crying