IVONA For Developers

Develop with IVONA Text-to-Speech.

2. System concepts

 

Account
Account of the IVONA.com registered user. Having an account with an IVONA TTS SaaS service active is required for the use of API. The registration process (creation of new accounts) isn’t available through API at the moment. New accounts could be created at https://secure.ivona.com/account/register.php (the registration page on the IVONA website). Each account is identified by a string: email. Additionally SpeechCloud service uses API Key, that could be generated at: https://secure.ivona.com/account/apikey.php  and alongside with email is used in the request authorization process.
Speech File
Sound file generated in the text-to-speech process of IVONA TTS SaaS from the UTF-8 encoded text supported by user. In addition to the text, the speech file is generated according to additional supported parameters: the voice which will read the text, the codec that will determine the output format and quality of sound, and additional sound parameters that will modify the speech in the desired way (change the speed or volume of it, modify the sound parameters or set ID3 tags in case of MP3 files). All speech file data is stored in the database and could be accessed only by its owner. The speech file is identified by an unique file identifier. The downloading of a speech file will result in decreasing the number of characters available in the active user account’s SaaS service.
Text
The text uploaded by user using createSpeechFile() method. The text should be UTF-8 encoded, and its MIME-type should be selected from the list of available content-types. The text is stored in the IVONA.com website database and could be accessed and deleted only by its owner (the uploader).
Table 2. Available content types
content type description

text/plain

The text will parsed by pronunciation rules, and then will be read as is.

text/html

The text will be converted from HTML to plain text (all tags will be removed, or replaced by pauses, making the text suitable for reading). After the conversion is completed the pronunciation rules will be applied.

text/ssml

The text will be interpreted as SSML 1.1, and validated with SSML 1.1 basic schema (http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis) All SSML elements, except <audio> and <lexicon> will be interpreted – and those two will be ignored. The pronunciation rules should work, except if they make the ssml document invalid.

Voice
Single Text To Speech synthesiser selected to process the text. There could be only one voice selected for a single speech file. The voice is identified by a voice identifier parameter. Currently there are following voices available:
Table 3. Available voices list
voice id voice name voice language voice gender
en_us_salli American English female (teenager) Salli
en_us_ivy American English female (child) Ivy
en_au_nicole Australian English female Nicole
en_us_kimberly American English female Kimberly
en_us_kendra American English female Kendra
en_us_jennifer American English female Jennifer
en_us_joey American English male Joey
en_us_eric American English male Eric
en_us_chipmunk American English none Chipmunk
en_gb_emma British English female Emma
en_gb_amy British English female Amy
en_gb_brian British English male Brian
en_au_russell Australian English male Russell
ru_tatyana Russian female Tatyana
es_us_penelope American Spanish female Penélope
es_us_miguel American Spanish male Miguel
pt_br_ricardo Brazilian Portuguese male Ricardo
pt_br_vitoria Brazilian Portuguese female Vitória
en_wls_geraint Welsh English male Geraint
en_wls_gwyneth Welsh English female Gwyneth
cy_geraint Welsh male Geraint
cy_gwyneth Welsh female Gwyneth
de_marlene German female Marlene
de_hans German male Hans
fr_celine French female Céline
fr_mathieu French male Mathieu
fr_ca_chantal Canadian French female Chantal
it_giorgio Italian male Giorgio
it_carla Italian female Carla
es_conchita Castilian Spanish female Conchita
es_enrique Castilian Spanish male Enrique
nl_lotte Dutch female Lotte
nl_ruben Dutch male Ruben
da_naja Danish female Naja
da_mads Danish male Mads
is_dora Icelandic female Dóra
is_karl Icelandic male Karl
pl_agnieszka Polish female Agnieszka
pl_maja Polish female Maja
pl_ewa Polish female Ewa
pl_jacek Polish male Jacek
pl_jan Polish male Jan
ro_carmen Romanian female Carmen
Pronunciation Rules
Table of rules (simple text substitutions and regular expression substitutions) intended for preprocessing the uploaded texts before they would be processed (synthesised) by voice. The main reason for using the pronunciation rules is to improve the pronunciation of specific words which are read by selected voice in a way different from the intended one (especially abbreviations, foreign words, etc.), or to remove parts of texts (specific sections, symbols, etc.) which shouldn’t be heard in a spoken text. There are two types of pronunciation rules: the internal pronunciation rules that are a part of IVONA TTS SaaS (supporting the pronunciation of most popular abbreviation, foreign names, and specific grammatical constructions) and are used always on the uploaded text, and user pronunciation rules that could be inserted by user and will be visible only to their owner and IVONA TTS SaaS engine. All pronunciation rules are assigned to the specific language. In the process of generating the speech file, during the usage of a voice that is intended to work in a specific language (for example Brian in English), user pronunciation rules created for such language will be used automatically BEFORE the internal pronunciation rules. The character price of the single download of the speech file is determined AFTER processing the file with the pronunciation rules. Pronunciation rules are divided into following languages in which voices are available:
Table 4. Available languages for pronunciation rules sets
language id language name list of voices assigned

en

English

us_chipmunk, us_jennifer, us_eric, us_kendra, us_joey, us_kimberly, us_salli, us_ivy, gb_amy, gb_brian, gb_emma, au_nicole, en_wls_geraint, en_wls_gwyneth

pl

Polish

pl_ewa, pl_maja, pl_jacek, pl_jan

ro

Romanian

ro_carmen

de

German

de_hans, de_marlene

es

Spanish

es_conchita, es_enrique, es_us_miguel, es_us_penelope

fr

French

fr_celine, fr_mathieu

cy

Welsh

cy_geraint, cy_gwyneth

Codec
The name of audio codec used in the process of generating the speech file. Tha encoder name is supported amongst the parameters of the createSpeechFile() method. There are several codecs currently available to use through the API:
Table 5. The list of available codecs
codec id codec description

mp3/22050

MP3, 64 kbit/s, 22.05 kHz

ogg/22050

OGG, 45 kbit/s, 22.05 kHz

pcm16/22050*

Uncompressed wav file, 16 bit, 22.05 kHz

pcm16/8000*

Uncompressed wav file, 16 bit, 8 kHz

alaw/8000*

Wav companded with A-law algorithm (for telecom purposes)

ulaw/8000*

Wav companded with µ-law algorithm (for telecom purposes)

(*) Non-streamable formats are available on demand – contact: sales@ivona.com

Sound file parameters
Parameters affecting the format of the speech file. Those parameters could for example change the audio speed, volume, pitch and other sound properties. They could also set specific values for the ID3v2 tags of a file. All parameters are optional and have default values set by IVONA TTS SaaS. The list of available parameters is constantly growing, and new ones will be available in the future. Currently there are following parameters available:
Table 6. Sound file parameters list

BASIC PARAMETERS

parameter name

parameter description

parameter value range

default value

additional info

Prosody-Volume

the volume of the recording in percentage of original volume of the voice

0-100

100

this parameter will change only the default volume used in the sound encoding process; it could be further changed by a sound player or device where the file will be installed

Prosody-Rate

the speed of the recording in percentage of the original speed of the voice

50-200

100

this parameter could be useful in the solutions directed at the visually impaired people (accustomed to the higher speed of provided speech) or for the foreign language learning solutions (slower speed will suit those solutions better)

Sentence-Break

the pause between sentences in milliseconds

0-3000

400

this parameter could be useful in the solutions intended to dictate texts to their receivers

Paragraph-Break

the pause between paragraphs (separated by empty lines in the uploaded text) in milliseconds

0-5000

650

this parameter could be useful in solutions based on splitting speech into separated blocks

ID3v2 TAGS SET FOR MP3 FILES

parameter name

parameter description

interpreted by IVONA.com Flash Player?

default value (if not set by user)

value example

Id3v2-TIT2

Frame TIT2 in ID3v2.4

yes (will show the name of a file in modes 1 and 2 of the player)

-

my speech file

Id3v2-TPE1

Frame TPE1 in ID3v2.4

yes (will show the author of a file in modes 1 and 2 of the player)

www.ivona.com

John Smith

Id3v2-TPE3

Frame TPE3 in ID3v2.4

yes (will link the name of a file in modes 1 and 2 of the player)

-

http://hostname/somepage

Id3v2-TPE4

Frame TPE4 in ID3v2.4

yes (will show the image assigned to the file in modes 1 and 2 of the player)

-

http://hostname/imagepath

Id3v2-TDTG

Frame TDTG in ID3v2.4

no

(the time of file encoding)

2010-02-01T12:00:05

Sound effects
Additional sound effects could be added on special request. Contact us at sales@ivona.com, for separate agreement on creating a modified voice.
Characters price
The “price” of downloading a file deducted from user’s account. When user activates an IVONA TTS SaaS service on his account specific number of characters are added to his account. The number of characters added depends on the type of agreement the user has signed with the IVONA.com sales department (in case of trial services this number is standarized (see http://www.ivona.com/saas.php for details). For each download of a speech file the number of characters calculated by the IVONA TTS SaaS is deducted from the user’s account. This price depends on the size of the text uploaded by user after processing it with the pronunciation rules. User could always check the price of a specific text using the checkPrice() API method. Every consecutive download of a speech file will deduct the character price of this file from user’s account.