MAGIC-TTS Demo

细粒度时序可控语音合成

Fine-Grained Controllable Speech Synthesis

使用同一条默认音色,在四个中文场景和两个英文场景下对比 controlled 与 spontaneous 合成效果。

Compare controlled and spontaneous generation across four Chinese scenes and two English scenes under the same default voice.

场景数 Scenes 6
模式 Modes controlled / spontaneous
控制粒度 Control Unit token duration / local pause

Controlled vs Spontaneous

场景对比

Scene Comparisons

Scene 01

导航转向

Navigation Turn

前方路口,左转。

目标:动作词需要被稳定听清,单靠自然合成不够。

Goal: the action phrase should remain clearly audible instead of being left entirely to natural timing.

v1 Baseline

所有内容字统一设为 170 ms。

All content tokens are fixed at 170 ms.

v2 Pause Only

保持内容字 170 ms,把动作前逗号停顿设为 260 ms。

Keep content at 170 ms and set the comma pause before the action phrase to 260 ms.

v3 Pause + Content

保留 260 ms 停顿,并把“左 / 转”拉长到 300 ms。

Keep the 260 ms pause and stretch “左 / 转” to 300 ms.

Scene 02

儿童跟读

Kids Reading

请跟我读,苹果。

目标:教学场景里,目标词音节需要被明确拉出来。

Goal: the target word should be clearly stretched in a guided-reading setting.

v1 Baseline

所有内容字统一设为 170 ms。

All content tokens are fixed at 170 ms.

v2 Pause Only

目标词前逗号停顿设为 260 ms。

Set the comma pause before the target word to 260 ms.

v3 Pause + Content

保留 260 ms 停顿,并把“苹 / 果”拉长到 300 ms。

Keep the 260 ms pause and stretch “苹 / 果” to 300 ms.

Scene 03

验证码播报

Accessibility Code

验证码是三七九,二一八。

目标:数字串播报必须控制分组和重点数字时长。

Goal: grouped digits and key numbers should be controlled explicitly rather than left to fluency alone.

v1 Baseline

所有内容字统一设为 170 ms。

All content tokens are fixed at 170 ms.

v2 Pause Only

把 3+3 分组边界处的逗号停顿设为 260 ms。

Set the boundary pause between the two 3-digit groups to 260 ms.

v3 Pause + Content

保留 260 ms 停顿,并把六个数字全部拉长到 300 ms。

Keep the 260 ms pause and stretch all six digits to 300 ms.

Scene 04

站点播报

Station Arrival

前方到站,五山站。

目标:站名前缀和站名本体需要分离,站名本体要被突出。

Goal: the station prefix and station name should be separated, and the station name should be emphasized.

v1 Baseline

所有内容字统一设为 170 ms。

All content tokens are fixed at 170 ms.

v2 Pause Only

站名前逗号停顿设为 260 ms。

Set the comma pause before the station name to 260 ms.

v3 Pause + Content

保留 260 ms 停顿,并把“五 / 山 / 站”拉长到 300 ms。

Keep the 260 ms pause and stretch “五 / 山 / 站” to 300 ms.

English Showcase

英文展示 Demo

English Showcase Demos

Scene 05

语义对比纠正

Contrastive Meaning

I said red notebook, not read notebook.

目标:通过毫秒级停顿和内容时长控制,让英文近音词纠正关系更清楚。

Goal: make the corrective contrast between near-homophones clearer with millisecond-level pause and content-duration control.

Spontaneous

不提供显式时长,让模型自然处理 red / read 的对比。

No explicit duration is provided; the model handles the red/read contrast naturally.

Controlled

控制文本:I said red{420} notebook,[160] not read{620} notebook.,把 red 设为 420 msread 设为 620 ms,并在纠正前加入 160 ms 停顿。

Controlled text: I said red{420} notebook,[160] not read{620} notebook. Set red to 420 ms, read to 620 ms, and insert a 160 ms pause before the correction.

Scene 06

戏剧化台词

Dramatic Line

She opened the door and whispered, run.

目标:通过毫秒级停顿和内容时长控制,展示英文叙事里的悬念与戏剧张力。

Goal: show suspense and dramatic emphasis in English narration with millisecond-level pause and content-duration control.

Spontaneous

不提供显式时长,让模型自然完成整句叙事。

No explicit duration is provided; the model narrates the line on its own.

Controlled

控制文本:She opened the door and whispered,[180] run{560}.,在 run 前加入 180 ms 停顿,并把 run 拉长到 560 ms

Controlled text: She opened the door and whispered,[180] run{560}. Insert a 180 ms pause before run and stretch run to 560 ms.

Spontaneous Duration Modeling

自发 Duration 建模 Demo

Spontaneous Duration Modeling Demos

导航转向

Navigation Turn

不提供任何 target-side 显式时长,模型自行建模节奏。

No explicit target-side duration is provided; the model predicts timing on its own.

儿童跟读

Kids Reading

不提供任何 target-side 显式时长,模型自行建模节奏。

No explicit target-side duration is provided; the model predicts timing on its own.

验证码播报

Accessibility Code

不提供任何 target-side 显式时长,模型自行建模节奏。

No explicit target-side duration is provided; the model predicts timing on its own.

站点播报

Station Arrival

不提供任何 target-side 显式时长,模型自行建模节奏。

No explicit target-side duration is provided; the model predicts timing on its own.