Demo Videos of Talking Avatar Generation Models
Sonic
Input: Image without Emotion + Audio with Emotion
Angry Audio
Input Image
Output | 10sec| DynamicScale=0.5
Input Image
Output | 10sec| DynamicScale=0.7
Input Image
Output | 10sec| DynamicScale=1
Crying Audio
Input Image
Output | 10sec| DynamicScale=0.7
Input Image
Output | 10sec| DynamicScale=1
Input Image
Output | 10sec| DynamicScale=1.5
Laugh Audio
Input Image
Output | 10sec| DynamicScale=0.5
Input Image
Output | 10sec| DynamicScale=0.7
Input Image
Output | 10sec| DynamicScale=1
Crying | Angry | Laugh
Input Image
Output | Crying | 6sec| DynamicScale=0.5
Input Image
Output | Angry | 6sec| DynamicScale=0.5
Input Image
Output | Laugh | 6sec| DynamicScale=0.5
Rap Song 1
Input Image
Output | 10sec| DynamicScale=1
Input Image
Output | 10sec| DynamicScale=1.5
Input Image
Output | 10sec| DynamicScale=2
Rap Song 1 | Output Duration: 6sec vs. 10sec
Input Image
Output | 6sec| DynamicScale=1.5
Input Image
Output | 10sec| DynamicScale=1.5
Rap Song 2
Input Image
Output | 10sec| DynamicScale=1
Input Image
Output | 10sec| DynamicScale=1.5
Input Image
Output | 10sec| DynamicScale=2
Input: Image with Emotion + Audio with Emotion
Laugh Audio 2
Input Image
Output | 6sec| DynamicScale=1.1
Input Image
Output | 6sec| DynamicScale=1.3
Input Image
Output | 6sec| DynamicScale=1.5
Input Image
Output | 6sec| DynamicScale=1.8
Input Image
Output | 6sec| DynamicScale=2
Input Image
Output | 13sec| DynamicScale=2
Laugh Audio 1
Input Image
Output | 6sec| DynamicScale=0.7
Input Image
Output | 6sec| DynamicScale=1
Angry Audio| Output Duration: 6sec vs. 12sec
Input Image
Output | 6sec| DynamicScale=1
Input Image
Output | 12sec| DynamicScale=1
Rap 2 | Crying
Input Image
Output | 10sec| DynamicScale=1 | Rap2
Input Image
Output | 6sec| DynamicScale=1 | Crying