前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >SP Modules Review Contents (2)

SP Modules Review Contents (2)

作者头像
杨丝儿
发布2022-11-15 09:11:46
3520
发布2022-11-15 09:11:46
举报
文章被收录于专栏:杨丝儿的小站杨丝儿的小站

Module 5 TTS front-end

We want to generate speech that is

  • Intelligible: you can clearly perceive what words are being said
  • Natural: sounds like human speech
  • Appropriate: conveys the right meaning in a specific context
  • Front-end: Analyze text, generate a linguistic specification of what to actually generate
    • Front-end purpose: derive a linguistic specification from text that includes the necessary information to generate speech
  • Back-end: Waveform generation from the linguistic specification

Linguistic specification guides what we generate

  • Phones
  • Syllables
  • Words
  • Phrases
  • Utterances
  • Discourses
  • Pronunciation Dictionaries: Use pre-existing pronunciation dictionaries to map words to phonetic transcriptions
    • CMUDict: 1 big text file of words and their pronunciations
      • CMUDict is dialect specific, like Scotland accent.
    • Unilex is an ‘accent-independent’ lexicon based on the Unisyn database
      • Classifies phones by keywords e.g. ‘Foot’ vs ‘Strut’ are keywords
        • ‘Put’ → FOOT class
        • ‘Putt’ → STRUT class
      • Use this to describe phonemic variation in English dialects/accents
      • A single lexicon to encode different accents: run lexicon through accent specific rules to produce accent specific lexica
  • Phoneset choice
    • Unilex is more generalizable than CMUDict
    • Unilex more compact: 1 base lexicon + rules
      • But we need to define rules to convert from one accent to another. This leads us to revisit the concept of phoneme

We use decision tree when we want learn rules from data.

Module 6 Waveform generation

Diphone database Requirements

  • Clean, clear recordings of a single speaker
  • Recordings of every possible diphone in the language
  • Phone segmentation (timings) to calculate where diphones start and end

The most common use of lexical stress marking is for determining which syllable in a word a pitch accent will be placed on if that word is made prosodically prominent.

The Tone and Break Indices (ToBI) model of prosody basically aims to capture prosodic prominence (pitch accents), boundary tones, and the extend of prosodic breaks (break indices). It doesn’t try to capture pragmatic or affective content of speech such as speech acts or emotions.

Spectral smoothing, as the name suggests, will help spectral discontinuity by making the change in the spectrum more smooth across a join.

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2022-10-31,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Module 5 TTS front-end
  • Module 6 Waveform generation
相关产品与服务
数据库
云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建,轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量,更专注于业务发展,让企业一站式享受数据上云及分布式架构的技术红利!
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档