This page demonstrates the excellent performance of Qwen2.5-Omni model fine-tuned on the MNV-17 dataset for Nonverbal Vocalization (NV) ASR recognition tasks
🔍 Key Findings
Unseen Speaker Generalization
Crucial Note: All demo samples are from speakers who were completely unseen during training
This demonstrates that the model learned universal NV vocalization patterns rather than merely fitting specific speakers' habits, showcasing excellent cross-speaker generalization.
The following shows the prediction results of Qwen2.5-Omni model fine-tuned on MNV-17 dataset on demo samples:
Sample 1
Model Prediction: 这个理论的悖论之处在于 [cough] 请大家原谅,在于它的前提本身——哎呀这粉笔灰真是 [sneeze]——它的前提本身就包含了结论,这听起来是不是很 [chuckle] 荒谬。
Sample 2
Model Prediction: 我给自己泡了杯热茶 [hum],又从冰箱里拿出那块昨天没舍得吃的提拉米苏 [smack],然后窝在沙发里刷着搞笑视频 [laugh],这大概就是最简单的幸福吧。
Sample 3
Model Prediction: 他开着那辆新买的跑车 [whistle] 从我身边呼啸而过,故意在我面前停下 [cough] 做出一个自以为很帅的姿势,我当场 [clap] 就给他鼓了鼓掌,要多敷衍有多敷衍。
Sample 4
Model Prediction: 晚饭时场面一度非常热闹,我家的猫[hiss]不让新来的客人靠近,小侄女表演完节目后大家[applaud]热情鼓掌,而对胡椒粉过敏的爷爷则[sneeze]打起了喷嚏。
Sample 5
Model Prediction: 我一走进那个积满灰尘的阁楼 [sneeze] 就忍不住打了个大喷嚏,面对着堆积如山的杂物 [sigh] 真不知从何下手,但当我从旧箱子里翻出那张绝版黑胶唱片时,我简直想为自己的好运 [clap] 欢呼!
Sample 6
Model Prediction: 老教授讲到他年轻时的糗事,全班 [laugh] 都笑得前仰后合,可他一提到这周末的作业 [moan] 我的头就开始疼了,而且这教室的灰尘也太多了 [sneeze],真是让人受不了。
Sample 7
Model Prediction: 看着电影里主角牺牲的画面,影院里一片 [sniffle] 的啜泣声,我却在内心为他的壮举 [applaud] 喝彩,然后长叹一口气 [hum],思考着人性的复杂。
Sample 8
Model Prediction: 回想起那些独自奋斗的夜晚 [sigh],真不知道是怎么熬过来的,但当我昨天终于攻克那个难题时 [clap],我知道所有的付出都值得了,最终项目成功发布那一刻 [applaud],我感受到了前所未有的成就感。
Sample 9
Model Prediction: 他最终 [exhale] 放下了手中的旧照片,心想我们 [sigh] 终究还是走到了这一步,一念及此,又想起她最后那些话 [hiss],真是刻薄又伤人。
🌍 Out-of-domain & Cross-language Evidence
The following clips from audio_samples/out_of_domain and audio_samples/cross_language highlight how the MNV-17 tuned model keeps its NV labels stable beyond the training distribution.
Out-of-domain clips
00039.wav
Model Prediction: 探出脑袋来,眼睛直勾勾地盯着大乌龟,韩非道 [ahem] 那个……嗨。
00178.wav
Model Prediction: 就是……然后后来就是完结了大概有一周的时间了吧,然后就开始回想,最初写《三分星野》的那些日子,[cough].
Cross-language clips
00185.wav
Model Prediction: [pant] Oh, so ghosts aren't scared of junkies, but loud noises give you a fright?
00604.wav
Model Prediction: Gale Arland has reported repeated encounters since 1960. Divorced, no children, lives alone… [cough]
0000_00059.wav
Model Prediction: [chuckle] Well, sometimes I go after people after they've faced the Shakespearean downfall and give them an opportunity to redeem themselves.
0000_00111.wav
Model Prediction: But maybe not. You never know. You know, I mean, look, you know [hum],
0000_00118.wav
Model Prediction: unfortunately sometimes fall for those sort of guys, hopefully I'm past that point. [laugh]