Fine-tuning Text to Speech with Azure Speech Service: Core Features | Exam AI-102

Fine-tuning Text to Speech with Azure Speech Service: Core Features

Question

You are tasked to fine tune text to speech using Speech service core features in Azure.

Based on the requirements, you want to control certain aspects of speech output such as adjust pitch or add pauses.

This will improve the quality of the synthesized content.

You also want to define your own lexicons and switch between different speaking styles.

Given the requirement, which core feature of text to speech service would you use?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: D.

Option A is incorrect because the long audio, as the name suggests is used for the speech files that are longer than 10 minutes.

They are meant for asynchronous responses.

Option B is incorrect because Neural voices are human like voices that utilize deep neural networks to overcome the limitations of traditional speech synthesis.

They are used in synthesizing speech for the audio books, chatbots and voice assistants.

Option C is incorrect because Visemes provide a visual description based on the position of face, lips, jaw and tongue.

They are used in two dimensional and three dimensional models to match speech movement of the mouth.

Option D is correct.

Speech Synthesis Markup Language or SSML is meant to tune the speech to text output.

Using SSML you can change speaking rate, volume, adjust pitch and add pauses.

You can also define your own lexicons and switch between different speaking styles.

Reference:

To learn more about core features of text to speech, use the link given below:

The core feature of Text to Speech service in Azure that would be suitable for fine-tuning the synthesized speech output as per the given requirements is "Speech Synthesis Markup Language (SSML)".

Speech Synthesis Markup Language (SSML) is an XML-based markup language that is used to control various aspects of speech synthesis such as pronunciation, intonation, and volume. It provides a flexible and fine-grained control over the synthesized speech output, making it an ideal choice for fine-tuning speech output.

Using SSML, one can adjust the pitch, rate, and volume of the synthesized speech, add pauses, and insert specific pronunciations of words or phrases. SSML also enables defining custom lexicons, which can be used to improve the pronunciation of words that may be specific to a particular domain or context.

Additionally, SSML provides the ability to switch between different speaking styles, such as male or female, and even different accents or languages. This feature is particularly useful when generating speech content for a multilingual or international audience.

On the other hand, the other options mentioned in the question are:

  • Asynchronous Synthesis of Long Audio: This feature enables the generation of longer audio files by breaking them into smaller parts, but it does not provide fine-grained control over speech output.

  • Neural Voices: Neural Voices are a type of text-to-speech voices that use deep neural networks to generate speech. While they provide high-quality speech output, they do not offer the fine-grained control that SSML provides.

  • Visemes: Visemes are visual representations of phonemes, which are used in lip-sync animation. They are not related to speech synthesis, and hence not suitable for the given requirements.

In summary, the core feature of Text to Speech service in Azure that would be suitable for fine-tuning the synthesized speech output as per the given requirements is "Speech Synthesis Markup Language (SSML)".