IVONA For Developers

Develop with IVONA Text-to-Speech.

2. System concepts

 

Account
Account of the IVONA.com registered user. Having an account with an IVONA TTS SaaS service active is required for the use of API. The registration process (creation of new accounts) isn’t available through API at the moment. New accounts could be created at https://secure.ivona.com/account/register.php (the registration page on the IVONA website). Each account is identified by a string: email. Additionally SpeechCloud service uses API Key, that could be generated at: https://secure.ivona.com/account/apikey.php  and alongside with email is used in the request authorization process.

Speech File
Sound file generated in the text-to-speech process of IVONA TTS SaaS from the UTF-8 encoded text supported by user. In addition to the text, the speech file is generated according to additional supported parameters: the voice which will read the text, the codec that will determine the output format and quality of sound, and additional sound parameters that will modify the speech in the desired way (change the speed or volume of it, modify the sound parameters or set ID3 tags in case of MP3 files). All speech file data is stored in the database and could be accessed only by its owner. The speech file is identified by an unique file identifier. The downloading of a speech file will result in decreasing the number of characters available in the active user account’s SaaS service.

Text
The text uploaded by user using createSpeechFile() method. The text should be UTF-8 encoded, and its MIME-type should be selected from the list of available content-types. The text is stored in the IVONA.com website database and could be accessed and deleted only by its owner (the uploader).

Table 2. Available content types
content type description

text/plain

The text will parsed by pronunciation rules, and then will be read as is.

text/html

The text will be converted from HTML to plain text (all tags will be removed, or replaced by pauses, making the text suitable for reading). After the conversion is completed the pronunciation rules will be applied.

text/ssml

The text will be interpreted as SSML 1.1, and validated with SSML 1.1 basic schema (http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis) All SSML elements, except <audio> and <lexicon> will be interpreted – and those two will be ignored. The pronunciation rules should work, except if they make the ssml document invalid.

Voice
Single Text To Speech synthesiser selected to process the text. There could be only one voice selected for a single speech file. The voice is identified by a voice identifier parameter. Currently there are following voices available:

Table 3. Available voices list
voice id voice name voice language voice gender

us_eric

Eric

American English

male

us_jennifer

Jennifer

American English

female

us_joey

Joey

American English

male

us_kendra

Kendra

American English

female

us_kimberly

Kimberly

American English

female

us_chipmunk

Chipmunk

American English

none

us_salli

Salli

American English

female (teenager)

us_ivy

Ivy

American English

female (child)

au_nicole

Nicole

Australian English

female

es_us_penelope

Penelope

American Spanish

female

es_us_miguel

Miguel

American Spanish

male

gb_amy

Amy

British English

female

gb_brian

Brian

British English

male

gb_emma

Emma

British English

female

en_wls_geraint

Geraint

Welsh English

male

en_wls_gwyneth

Gwyneth

Welsh English

female

cy_geraint

Geraint

Welsh

male

cy_gwyneth

Gwyneth

Welsh

female

de_marlene

Marlene

German

female

de_hans

Hans

German

male

es_conchita

Conchita

Castilian Spanish

female

es_enrique

Enrique

Castilian Spanish

male

fr_mathieu

Mathieu

French

male

fr_celine

Celine

French

female

pl_ewa

Ewa

Polish

female

pl_jacek

Jacek

Polish

male

pl_jan

Jan

Polish

male

pl_maja

Maja

Polish

female

ro_carmen

Carmen

Romanian

female

Pronunciation Rules
Table of rules (simple text substitutions and regular expression substitutions) intended for preprocessing the uploaded texts before they would be processed (synthesised) by voice. The main reason for using the pronunciation rules is to improve the pronunciation of specific words which are read by selected voice in a way different from the intended one (especially abbreviations, foreign words, etc.), or to remove parts of texts (specific sections, symbols, etc.) which shouldn’t be heard in a spoken text. There are two types of pronunciation rules: the internal pronunciation rules that are a part of IVONA TTS SaaS (supporting the pronunciation of most popular abbreviation, foreign names, and specific grammatical constructions) and are used always on the uploaded text, and user pronunciation rules that could be inserted by user and will be visible only to their owner and IVONA TTS SaaS engine. All pronunciation rules are assigned to the specific language. In the process of generating the speech file, during the usage of a voice that is intended to work in a specific language (for example Brian in English), user pronunciation rules created for such language will be used automatically BEFORE the internal pronunciation rules. The character price of the single download of the speech file is determined AFTER processing the file with the pronunciation rules. Pronunciation rules are divided into following languages in which voices are available:

Table 4. Available languages for pronunciation rules sets
language id language name list of voices assigned

en

English

us_chipmunk, us_jennifer, us_eric, us_kendra, us_joey, us_kimberly, us_salli, us_ivy, gb_amy, gb_brian, gb_emma, au_nicole, en_wls_geraint, en_wls_gwyneth

pl

Polish

pl_ewa, pl_maja, pl_jacek, pl_jan

ro

Romanian

ro_carmen

de

German

de_hans, de_marlene

es

Spanish

es_conchita, es_enrique, es_us_miguel, es_us_penelope

fr

French

fr_celine, fr_mathieu

cy

Welsh

cy_geraint, cy_gwyneth

Codec
The name of audio codec used in the process of generating the speech file. Tha encoder name is supported amongst the parameters of the createSpeechFile() method. There are several codecs currently available to use through the API:

Table 5. The list of available codecs
codec id codec description

mp3/22050

MP3, 64 kbit/s, 22.05 kHz

ogg/22050

OGG, 45 kbit/s, 22.05 kHz

pcm16/22050*

Uncompressed wav file, 16 bit, 22.05 kHz

pcm16/8000*

Uncompressed wav file, 16 bit, 8 kHz

alaw/8000*

Wav companded with A-law algorithm (for telecom purposes)

ulaw/8000*

Wav companded with µ-law algorithm (for telecom purposes)

(*) Non-streamable formats are available on demand – contact: sales@ivona.com

Sound file parameters
Parameters affecting the format of the speech file. Those parameters could for example change the audio speed, volume, pitch and other sound properties. They could also set specific values for the ID3v2 tags of a file. All parameters are optional and have default values set by IVONA TTS SaaS. The list of available parameters is constantly growing, and new ones will be available in the future. Currently there are following parameters available:

Table 6. Sound file parameters list

BASIC PARAMETERS

parameter name

parameter description

parameter value range

default value

additional info

Prosody-Volume

the volume of the recording in percentage of original volume of the voice

0-100

100

this parameter will change only the default volume used in the sound encoding process; it could be further changed by a sound player or device where the file will be installed

Prosody-Rate

the speed of the recording in percentage of the original speed of the voice

50-200

100

this parameter could be useful in the solutions directed at the visually impaired people (accustomed to the higher speed of provided speech) or for the foreign language learning solutions (slower speed will suit those solutions better)

Sentence-Break

the pause between sentences in milliseconds

0-3000

400

this parameter could be useful in the solutions intended to dictate texts to their receivers

Paragraph-Break

the pause between paragraphs (separated by empty lines in the uploaded text) in milliseconds

0-5000

650

this parameter could be useful in solutions based on splitting speech into separated blocks

ID3v2 TAGS SET FOR MP3 FILES

parameter name

parameter description

interpreted by IVONA.com Flash Player?

default value (if not set by user)

value example

Id3v2-TIT2

Frame TIT2 in ID3v2.4

yes (will show the name of a file in modes 1 and 2 of the player)

-

my speech file

Id3v2-TPE1

Frame TPE1 in ID3v2.4

yes (will show the author of a file in modes 1 and 2 of the player)

www.ivona.com

John Smith

Id3v2-TPE3

Frame TPE3 in ID3v2.4

yes (will link the name of a file in modes 1 and 2 of the player)

-

http://hostname/somepage

Id3v2-TPE4

Frame TPE4 in ID3v2.4

yes (will show the image assigned to the file in modes 1 and 2 of the player)

-

http://hostname/imagepath

Id3v2-TDTG

Frame TDTG in ID3v2.4

no

(the time of file encoding)

2010-02-01T12:00:05

Sound effects
Additional sound effects could be added on special request. Contact us at sales@ivona.com, for separate agreement on creating a modified voice.

Characters price
The “price” of downloading a file deducted from user’s account. When user activates an IVONA TTS SaaS service on his account specific number of characters are added to his account. The number of characters added depends on the type of agreement the user has signed with the IVONA.com sales department (in case of trial services this number is standarized (see http://www.ivona.com/saas.php for details). For each download of a speech file the number of characters calculated by the IVONA TTS SaaS is deducted from the user’s account. This price depends on the size of the text uploaded by user after processing it with the pronunciation rules. User could always check the price of a specific text using the checkPrice() API method. Every consecutive download of a speech file will deduct the character price of this file from user’s account.