Speech Cloud Documentation

Data types

Note

The API of the service might change at any time during the beta phase. We strongly encourage using one of the available client libraries to integrate with Speech Cloud.
See the Developer Guide for more information.

Input

Contains attributes describing the user input.

Data

Description

The text or ssml data in UTF-8 encoding that should be synthesized. The format of the data is determined by the Type attribute. There are different length restrictions in the case of POST and GET types of request.

Type

string

Range

Length of the data between 1 and 8192 UTF-8 characters (POST) or 1 and 1024 UTF-8 characters (GET)

Type

Description

The format of the data provided in the Data attribute. Currently, text/plain (where text is read as is) and application/ssml+xml (where data is interpreted according to the SSML/1.0 or SSML/1.1 format with a few restrictions as described here) values are supported. IVONA Speech Cloud adds the following limitations to the SSML input:

  • Voice switching is not allowed

  • "repeatCount" attribute of "audio" element is limited to 5 times

  • "repeatDur" attribute of "audio" element is limited to 15 seconds

  • "time" attribute of "break" element is limited to 10 seconds

Type

string

Range

One of the following:

  • "text/plain",

  • "application/ssml+xml"

OutputFormat

Contains attributes describing the output audio format and encoding type. The following table lists all currently available combinations of codecs, audio coding format, sample rates, and speech marks.

Table 1. Available codec, audio coding format, sample rate, and speech mark combinations:
Codec Audio Coding Format Sample Rate Quality Speech Mark Support

MP3

MP3

22050

CBR 48kbit

No

MP4

MP3

22050

CBR 48kbit

Yes

OGG

Vorbis

22050

VBR quality:2

No

Type: an OutputFormat object.

Codec

Description

The identifier of an audio encoding format.

Type

string

Range

One of the following:

  • "MP3"

  • "MP4"

  • "OGG"

  • "SPEECHMARK"

SampleRate

Description

The sample rate of the compressed audio in Hz.

Type

number

Range

Currently, a sample rate of 22050 (in Hz) is supported.

SpeechMarks

Speech marks are metadata describing the speech. There are four different types of speech marks:

  • Sentence - Describes a sentence to be spoken.

  • SSML - Describes a <mark> element from the SSML input to be spoken.

  • Viseme - Describes a viseme corresponding to a phoneme to be spoken.

  • Word - Describes a word to be spoken.

For more information on how to request speech marks and their format, please refer to IVONA Speech Cloud Developer Guide.

Sentence

Description

If true, sentence speech marks are added to the output container. Sentence speech marks describe a sentence to be spoken.

Type

boolean

Range

One of the following:

  • true

  • false

SSML

Description

If true, ssml speech marks are added to the output container. SSML speech marks describe a <mark> element from the SSML input to be spoken.

Type

boolean

Range

One of the following:

  • true

  • false

Viseme

Description

If true, viseme speech marks are added to the output container. Viseme speech marks describe a viseme corresponding to a phoneme to be spoken.

Type

boolean

Range

One of the following:

  • true

  • false

Word

Description

If true, word speech marks are added to the output container. Word speech marks describe a word to be spoken.

Type

boolean

Range

One of the following:

  • true

  • false

Parameters

Additional attributes affecting the generated speech.

Rate

Description

The speed of speech represented by the SSML prosody rate label, affecting the speed of text reading without affecting other voice characteristics (such as voice pitch). Values of "default" and "medium" are equal.

Type

string

Range

One of the following:

  • x-slow

  • slow

  • medium

  • fast

  • x-fast

  • default

Additional notes

The following table lists the approximate changes of the speed of speech. These values are subject to change in the future.

Table 2. Prosody "rate" label mapping to relative speed change
"Rate" label Percentage of default speed

x-slow

67%

slow

82%

medium, default

100%

fast

122%

x-fast

150%

Volume

Description

The volume level of speech represented by the SSML prosody volume label, affecting the general loudness of the synthesis without affecting other voice characteristics. Values of "default" and "medium" are equal.

Type

string

Range

One of the following:

  • silent

  • x-soft

  • soft

  • medium

  • loud

  • x-loud

  • default

Additional notes

The following table lists the approximate changes of the volume of speech. Those values are subject to change in the future:

Table 3. Prosody "volume" label mapping to relative volume change
"Volume" label Percentage of default volume

silent

0%

x-soft

63%

soft

79%

medium, default

100%

loud

126%

x-loud

160%

SentenceBreak

Description

The pause (in milliseconds) after each sentence, with the exception of the end of the paragraph (pause set separately).

Type

number

Range

Integer in the range of 0-3000 (in milliseconds)

ParagraphBreak

Description

The pause (in milliseconds) after each paragraph.

Type

number

Range

Integer in the range of 0-5000 (in milliseconds)

Voice

List of properties of the voice that could be used for the speech synthesis. All specified attributes are used to do a best match, and are optional. In case no matching voice is found, the VoiceNotFoundException error is returned.

Type: a Voice object.

Name

Description

The case-sensitive name of the TTS voice. The same voice talent behind the voice name can talk in different languages, thus the Name does not uniquely identify the actual TTS voice.

Type

string

Range

Any voice name returned by the ListVoices action.

Language

Description

The case-sensitive language code of the TTS voice.

Type

string

Range

Language code according to the BCP47 recommendation (http://tools.ietf.org/html/bcp47) with lowercase language code and uppercase country/region codes. The complete list of available voices with their language codes is returned by the ListVoices action.

Gender

Description

The case-sensitive gender of the TTS voice.

Type

string

Range

One of the following:

  • Female

  • Male

Lexicon

A Pronunciation Lexicon Specification used to set pronunciation and substitution rules for the synthesis. Every Lexicon is tied to a specific language (see W3C’s reference) and is identified by a unique name.

Name

Description

A user-specified name that uniquely identifies the lexicon.

Type

string

Range

A UTF-8 string with no more than 10 alphanumeric characters. It cannot contain whitespace or any other special characters.

Contents

Description

The Pronunciation Lexicon Specification.

Type

string

Range

Length of the contents should be no more than 4096 Bytes. Must be a valid PLS.

 
Copyright © 2015 IVONA Software. All rights reserved. Terms of Use | Privacy Policy