The IVONA Text-to-Speech synthesizer is a versatile system that correctly transforms most written data into human-like, natural speech. The IVONA synthesizer operates on written, fully expanded words. However, input text documents contain not only full words, such as milk and sugar, but also various other language units, such as numbers (15), dates (3/4/2003), acronyms (USA), abbreviations (i.e.), symbols ($), etc. All individual language units must be first consistently expanded into full words before they get synthesized. This conversion takes place internally within the synthesizer and is called text normalization.
The American and British English IVONA Text-To-Speech voices correctly normalize and synthesize the majority of English texts. This document describes various text normalization processes that all written input data undergoes before being synthesized.
The text normalization processes can be extended by means of the IVONA regular expressions lexicon (described in a separate document) and by using PLS lexicons (W3C Recommendation) which are fully customizable by the end-user.
This section describes how unannotated input text is split into paragraphs, sentences and words.
Paragraphs are separated by empty lines.
Paragraphs may be explicitly marked with SSML elements p.
A sentence contains by default less than 1000 characters. Sentences longer than that will be broken into multiple smaller sentences.
Sentences may be explicitly marked with SSML elements s.
A word contains by default less than 100 characters. Words longer than that will be broken into multiple smaller words.
Words without any vowels will be spelled out.
IVONA will properly handle words with apostrophes, such as the standard contractions: 'll, 've, 'd, n't, the genitive 's, as well as common phrases such as rock’n'roll or c'mon.
IVONA accepts all Unicode characters. IVONA handles most characters found in texts based on the Latin script.
Punctuation plays a key role in the way texts are interpreted by the TTS system. IVONA supports majority of punctuation marks found in English texts. However, in the end all punctuation marks which have effect on pauses or intonation are mapped to the following marks.
rising or falling
This section describes in general how IVONA normalizes input text, excluding text fragments marked with the SSML say-as element.
This section is not exhaustive. IVONA normalizes various text units but only the most common ones have been included in this description.
A cardinal number is either any single digit (0, 1, …, 9) or a sequence of digit not starting with 0.
Longer cardinal numbers may make use of comma as a thousands separator.
10,000 will be pronounced ten thousand.
256 will be pronounced two hundred (and) fifty six.
4358 will be pronounced four thousand three hundred (and) fifty eight.
1,000 will be pronounced one thousand.
A signed integer consists of a sign character followed immediately by a cardinal number. Valid sign characters are the plus sign (+), the minus sign (−, U+2212) and the plus-minus sign (±). The popular hyphen-minus character (-), as well as other dash-like characters, are also supported as the sign character, but they are ambiguous and should best be avoided.
+5 will be pronounced plus five.
−3,000 will be pronounced minus three thousand.
A cardinal or signed integer followed immediately by the dot and a sequence of digits will be recognized as a real number.
4.5 will be pronounced four point five.
-3.1 will be pronounced minus three point one.
1,000.12 will be pronounced one thousand point one two.
A cardinal number with suffixes st, nd, rd or th is interpreted as an ordinal number. The suffixes st, nd and rd may only be applied to numbers for which the ordinal ends in these letters. The suffix th may be applied to any cardinal number.
21st will be pronounced twenty first.
42nd will be pronounced forty second.
6th will be pronounced sixth.
1,000,000th will be pronounced one millionth.
Cardinal followed by s or 's will follow the same pattern as regular plural English words, examples below.
60s will be pronounced sixties.
100s will be pronounced one hundreds.
bc 52's will be pronounced b c fifty two’s.
IVONA supports various Roman numerals.
All uppercase Roman numerals with an appropriate lowercase ordinal suffix are pronounced as ordinal numbers.
LIst will be pronounced fifty first.
MMXIth will be pronounced two thousand eleventh.
Uppercase Roman numerals in names of monarchs will be read as ordinal numbers preceded with the word the.
Queen Elizabeth II will be pronounced queen elizabeth the second.
Henry III of England will be pronounced henry the third of england.
Small uppercase and lowercase Roman numerals in other contexts will be pronounced as cardinal numbers.
Chapter XIX will be pronounced chapter nineteen.
World War II will be pronounced world war two.
xxiii will be pronounced twenty three.
A fraction consists of the following elements in order:
An optional sign character.
An optional whole number (cardinal) followed by the space character.
The numerator (a cardinal number).
The slash (/ U+002F) or the solidus character (⁄ U+2044).
The denominator (a cardinal number)
Fractions with the slash character are recognized only for the most common denominators. Fractions with the solidus character are always correctly recognized.
3/4 will be pronounced three fourths (American) or three quarters (British).
2 1/2 will be pronounced two and a half.
−7 2/3 will be pronounced minus seven and two thirds.
15⁄5678 (solidus only) will be pronounced fifteen five thousand six hundred seventy eighths.
Sequences of more than one digit starting with 0 are always read as a sequence of digits.
Similarily are handled digits in fixed formats, such as telephone numbers or social security numbers.
0123 will be pronounced oh one two three.
924-51-0387 will be pronounced nine two four five one zero three eight seven.
236-555-1234 will be pronounced two three six five five five one two three four.
IVONA handles a wide variety of commonly as well as rarely used units, including metric and imperial systems. Some unit symbols are always recognized, others need a preceding number.
fl oz will be pronounced fluid ounce.
14'5" will be pronounced fourteen feet five inches.
1h2m30s will be pronounced one hour two minutes thirty seconds.
5 tsp will be pronounced five teaspoons.
1 tbsp will be pronounced one tablespoon.
2.6 GHz will be pronounced two point six gigahertz.
25 MPH will be pronounced twenty five miles per hour.
8 nmi will be pronounced eight nautical miles.
-0.01% will be pronounced minus zero point zero one percent.
90° will be pronounced ninety degrees.
50¢ will be pronounced fifty cents.
40 km/h will be pronounced forty kilometers per hour.
IVONA supports a certain number of currencies in multiple formats. Valid currency symbols include commonly used symbols such as £, $, €, ¥, ₩, $AU, SG$, as well as many of the ISO 4217 currency codes (uppercase only).
The number may be followed by the words million, billion, trillion, or their various abbreviations. In this case the currency will be pronounced at the end.
The value may have a thousands separator which may be either a comma or a space.
$10 will be pronounced ten dollars.
USD5.27 will be pronounced five u s dollars and twenty seven cents.
£5.27 will be pronounced five pounds and twenty seven pence.
GBP 1,000 will be pronounced one thousand pounds.
¥1 million will be pronounced one million yen.
¥5.27 will be pronounced five yen and twenty seven sen.
CHF6M will be pronounced six million swiss francs.
€ 20 000 will be pronounced twenty thousand euros.
C$ 2.3 mn will be pronounced two point three million canadian dollars.
IVONA supports time specified in both the 12-hour and the 24-hour clock.
1:59 will be pronounced one fifty nine.
2:00 will be pronounced two o’clock.
01:59am will be pronounced one fifty nine _a m.
2 AM will be pronounced two _a m.
13:00 will be pronounced thirteen hundred hours.
10:25:30 will be pronounced ten twenty five and thirty seconds.
07:53:10 A.M. will be pronounced seven fifty three and ten seconds_a m.
IVONA also handles duration specified in multiple formats.
5'30" (only for seconds greater than 11) will be pronounced five minutes and thirty seconds.
5m30s will be pronounced five minutes and thirty seconds.
3h10m will be pronounced three hours and ten minutes.
IVONA supports geographic coordinates in the following combinations: degrees, degrees and minutes, and degrees, minutes, and seconds. The recognized symbol for degrees is °. An optional cardinal direction, N, E, S, or W, may be included with an optional space in between the geographic coordinate and the cardinal direction.
74.3°W will be pronounced seventy four point three degrees west.
13°45\' N will be pronounced thirteen degrees and forty five minutes north.
29°40'33" will be pronounced twenty nine degrees forty minutes and thirty three seconds.
Only the highest precision component present in the coordinate may be expressed with a decimal component.
74.04° will be pronounced seventy four point zero four degrees.
13°26.5\' will be pronounced thirteen degrees and twenty six point five minutes.
98°59'56.01" will be pronounced ninety eight degrees fifty nine minutes and fifty six point zero one seconds.
If the minutes and/or second components are under 10, e.g. 0, 1, …, 8, and 9, they may be preceeded by a leading zero, 0, even if they are followed by a decimal component.
89°00\' will be pronounced eighty nine degrees and zero minutes.
37°00'05.6" will be pronounced thirty seven degrees zero minutes and five point six seconds.
One-digit numbers for the day and for the month may have an optional leading zero.
Supported formats for month expressions: numbers (4, 04), name (April), abbreviation (Apr).
The year can be expressed with either 2 or 4 digits.
Standard US format (M/D/Y, M-D-Y, M.D.Y), default for American English voices:
12/31/1999 will be pronounced december thirty first nineteen ninety nine.
10-25-99 will be pronounced october twenty fifth nineteen ninety nine.
Dec/31/1999 will be pronounced december thirty first nineteen ninety nine.
April-25-1999 will be pronounced april twenty fifth nineteen ninety nine.
European format (D/M/Y, D-M-Y, D.M.Y), default for British English voices:
12/may/1995 will be pronounced may twelfth nineteen ninety five.
12-Apr-2007 will be pronounced april twelfth two thousand seven.
20.3.2011 will be pronounced march twentieth twenty eleven.
ISO 8601 standard (Y-M-D, Y/M/D, Y.M.D), only 4-digit year:
2007/01/01 will be pronounced january first two thousand seven.
2007-Jan-01 will be pronounced january first two thousand seven.
2007-January-01 will be pronounced january first two thousand seven.
Other common formats:
June 2 will be pronounced june second.
Aug. 5, 1921 will be pronounced august fifth nineteen twenty one.
arrive on 3/4 will be pronounced arrive on march fourth.
A number will be read as a year if it is followed by BC or if it is preceded or followed by AD:
1063 A.D. will be pronounced ten sixty three _a d.
IVONA interprets ranges of numbers, measurements, time and date.
ages 3–5 will be pronounced ages three to five.
40Hz–20kHz will be pronounced forty hertz to twenty kilohertz.
June 15-20 will be pronounced june fifteenth to twentieth.
1939-1945 will be pronounced nineteen thirty nine to nineteen forty five.
Most abbreviations will be expanded to full words. There will be no sentence break on the dot sign (full stop) following a supported abbreviation. In order to force a sentence break please use two dot signs: one to mark the abbreviation and one to mark the sentence ending.
i.e. Mr T vs bros Inc. will be interpreted as that is mister t versus brothers incorporated.
Initialisms with a period (dot) following each letter (e.g. U.S., F.B.I.) will be pronounced by spelling out each letter.
Most common initialisms without dots (e.g. US, FBI) will be also recognized as such and properly pronounced.
All vowelless words are recognized as initialisms.
N.Y.P.D. will be pronounced n y p d.
In the US will be pronounced in the u s.
an IT report will be pronounced an i t report.
BBC will be pronounced b b c.
pwq will be pronounced p w q.
In most cases IVONA properly recognizes and normalizes street addresses in the United States and Canada.
159 W. Popplar Av., Ste. 5, St. George, CA 12345 will be pronounced one fifty nine west popplar avenue, suite five, saint george california one two three four five.
IVONA recognizes most American telephone number formats and reads them as series of digits.
(978) 555-2345 will be pronounced as nine seven eight five five five two three four five.
1-800-555-1234 ex. 10 will be pronounced as one eight hundred five five five one two three four extension one zero.
Non-words not described elsewhere will be treated as identifiers. This group includes mixes of letters and digits, such as r121, as well as URL’s, e-mail addresses, or fancy proper names unknown to the synthesizer.
Numbers within identifiers such as r121, x01, b987654 will be read in groups of two if they consist of up to 4 digits, and will be read as a series of digits otherwise.
Punctuation characters within identifiers will be pronounced.
er125lp will be pronounced er one twenty five l p.
http://www.ivona.com will be pronounced h t t p colon slash slash w w w dot ivona dot com.
B!0 will be pronounced b exclamation mark zero.
IVONA recognizes U.S. radio call codes (three or four capital letters starting with K or W with an optional -FM suffix) in some contexts, and spells them. Moreover, it recognizes and normalizes FM or AM frequencies.
KIIS 102.7 will be pronounced as k.i.i.s. one oh two point seven.
+102.7 FM will be pronounced as one oh two point seven f.m..
+KRUX 1360 AM will be pronounced as k.r.u.x. thirteen sixty a.m..
The SSML element say-as gives users the possibility to annotate fragments of text in order to force particular interpretation.
Marking a fragment with say-as disables most default normalization rules, which would have otherwise been applied. Therefore, it is advised to mark text with say-as scarcely, only when the default normalization rules fail and render different speech than expected by the user.
The standards authority W3C Working Group has issued a note SSML 1.0 say-as attribute values, which is mostly followed by IVONA.
IVONA will interpret a value as a date, when used within say-as with interpret-as="date". This works just as defined in the W3C note. The format attribute may be set to any of the following: mdy, dmy, ymd, md, dm, ym, my, d, m, y.
<say-as interpret-as="date" format="ymd">01/02/03</say-as> will be pronounced february third two thousand one (American) or the third of february two thousand and one (British).
<say-as interpret-as="date">1234</say-as> will be pronounced twelve thirty four.
Tokens like 1:20:30, 1:20, or 1 can be recognized as duration in hours minutes and seconds by surrounding with say-as having interpret-as="duration". The format attribute should be set to any of the following: hms, hm, ms, h, m, s.
The same say-as tag may be used to recognize tokens like 7'10" as duration in minutes and seconds, and tokens in ISO8601 format like PT2H30M15S, P2D3H, or P2W. In this case the format attribute is ignored.
<say-as interpret-as="duration" format="hm">1:23</say-as> will be pronounced one hour and twenty three minutes.
<say-as interpret-as="duration" format="ms">1:23</say-as> will be pronounced one minute amd twenty three seconds.
<say-as interpret-as="duration" format="ms">1:00</say-as> will be pronounced one minute.
<say-as interpret-as="duration" format="hms">1:10:23</say-as> will be pronounced one hour ten minutes and twenty three seconds.
<say-as interpret-as="duration" format="h">10</say-as> will be pronounced ten hours.
<say-as interpret-as="duration">1'23"</say-as> will be pronounced one minute and twenty three seconds.
<say-as interpret-as="duration">P1Y2M3DT4H5M6S</say-as> will be pronounced one year two months three weeks four days five hours six minutes and seven seconds.
<say-as interpret-as="duration">P1DT12H</say-as> will be pronounced one day and twelve hours.
Telephone numbers may be marked with the say-as element having interpret-as="telephone". In a telephone number IVONA will read most digits and letters individually, as well as properly read the extension number and the characters * and #.
<say-as interpret-as="telephone">1-800-555-234 ex. 23</say-as> will be pronounced one eight hundred five five five two three four extension two three.
<say-as interpret-as="telephone">*53#</say-as> will be pronounced star five three pound (American) or star five three hash (British).
IVONA will read individual characters for text within the say-as element having interpret-as="characters". The format attribute is ignored. The detail attribute may be used to force pauses, as described in the W3C Note.
<say-as interpret-as="characters">speed</say-as> will be pronounced s p e e d.
<say-as interpret-as="characters" detail="3 1 2">1a3BZ7</say-as> will be pronounced one _a three, b, z seven.
IVONA will attempt to read values within say-as having interpret-as="cardinal" as cardinal numbers. The format and detail attributes are ignored. Roman numerals are supported.
<say-as interpret-as="cardinal">1999</say-as> will be pronounced one thousand nine hundred (and) ninety nine.
<say-as interpret-as="cardinal">CLI</say-as> will be pronounced one hundred (and) fifty one.
IVONA will attempt to read values within say-as having interpret-as="ordinal" as ordinal numbers. The format and detail attributes are ignored. Roman numerals are supported.
<say-as interpret-as="ordinal">21</say-as> will be pronounced twenty first.
<say-as interpret-as="ordinal">VI</say-as> will be pronounced sixth.
IVONA will interpret values within say-as having interpret-as="fraction" as common fractions. The syntax for fractions is any of the following:
["+" | "−" | "±"] cardinal "/" cardinal.
["+" | "±"] cardinal "+" cardinal "/" cardinal.
"−" cardinal "−" cardinal "/" cardinal.
where cardinal is a number as defined in Cardinal numbers above.
<say-as interpret-as="fraction">2/9</say-as> will be pronounced two ninths.
<say-as interpret-as="fraction">3+1/2</say-as> will be pronounced three and one half.
<say-as interpret-as="fraction">−2−3/8</say-as> will be pronounced minus two and three eighths.
Measurements may be marked with say-as having interpret-as="unit" (or interpret-as="measure"). The valid syntax is the following:
symbol [ "2" | "3" | "4" | "²" | "³" ] [ "/" unit ]
number "-" unit
A unit symbol may be almost any of the standard metric, imperial or other unit symbols, e.g. N (newtons), kJ (kilojoules), mi (miles), sqft (square feet), MiB (mebibytes), ly (light years), tbsp (tablespoons), °F (degrees Fahrenheit), psi (pounds per square inch), etc. The unit name does not contain periods (dots). In general the unit symbols are case sensitive, so B is bytes and b is bits, but unambiguous symbols are matched case-insensitively, so that either the proper Hz or improper hz, HZ and hZ will all be treated as the frequency unit hertz.
The SI prefixes as well as binary prefixes may be prepended to unit symbols, if appropriate.
If there is only a unit given without a preceding number, then the singular form will be used.
In unambiguous cases, the letter s may be appended to a symbol to force plural even though the number would need a singular qualifier, e.g. 1mph is one mile per hour, but 1mphs will be one miles per hour.
A unit symbol may be suffixed with a power like 2 or ³, so that m² is square meters and s³ is seconds cubed.
The adjective measurement forces singular unit form, so that whereas 2in is two inches, 2-in is two inch.
<say-as interpret-as="unit">2nmi</say-as> will be pronounced two nautical miles.
<say-as interpret-as="unit">1+1/2tsp</say-as> will be pronounced one and one half teaspoons.
<say-as interpret-as="unit">5m/s2</say-as> will be pronounced five meters per second squared.
<say-as interpret-as="unit">2,100rpm</say-as> will be pronounced two thousand one hundred revolutions per minute.
<say-as interpret-as="unit">2.7µF</say-as> will be pronounced two point seven microfarads.
<say-as interpret-as="unit">km</say-as> will be pronounced kilometer.
<say-as interpret-as="unit">kms</say-as> will be pronounced kilometers.
Street addresses or parts of an address may be marked with say-as having interpret-as="address". This will force special pronunciation of numbers and expansion of abbreviations.
The two-letter US state abbreviation will be expanded only when followed by a ZIP code. However, one may force expanision elsewhere by specifying the attribute format="us-state".
<say-as interpret-as="address">320 W Mt Willson Ct</say-as> will be pronounced three twenty west mount willson court.
<say-as interpret-as="address">rm. 103</say-as> will be pronounced room one oh three.
<say-as interpret-as="address">Ft Worth, TX 12345</say-as> will be pronounced fort worth texas one two three four five.
<say-as interpret-as="address" format="us-state">CO</say-as> will be pronounced colorado.
Radio call codes and frequencies may be marked with say-as having interpret-as="radiostation". This will result in spelling out of U.S. call codes and FM or AM frequencies.
<say-as interpret-as="radiostation">WNYC</say-as> will be pronounced w.n.y.c..
<say-as interpret-as="radiostation">107.3</say-as> will be pronounced one oh seven point three.
<say-as interpret-as="radiostation">1070</say-as> will be pronounced ten seventy.
<say-as interpret-as="radiostation">AM</say-as> will be pronounced a.m..
The role attribute of w and token elements in an SSML document may be used to choose particular pronunciation of homographs. The possible values of this attribute are the following:
ivona:DT — Interpret the word as a determiner.
ivona:IN — Interpret the word as a preposition.
ivona:JJ — Interpret the word as an adjective.
ivona:NN — Interpret the word as a noun.
ivona:VB — Interpret the word as a verb.
ivona:VBD — Interpret the word as a verb in past tense.
ivona:DEFAULT — Use the default sense of the word.
ivona:SENSE_1 — Use the non-default sense of the word, which has a different pronunciation.
In most cases, however, IVONA properly chooses the pronunciation of an ambiguous word, and it doesn’t need to be explicitly marked.
<w role="ivona:VB">read</w> will be pronounced /ˈɹiːd/, as in I will read a book.
<w role="ivona:VBD">read</w> will be pronounced /ˈɹɛd/, as in I have read a book.
<w role="ivona:VB">object</w> will be pronounced /əbˈd͡ʒɛkt/, as in Do you object?.
<w role="ivona:NN">object</w> will be pronounced /ˈɑb.d͡ʒɛkt/ (American) or /ˈɒb.d͡ʒɪkt/ (British), as in This object is big.
<w role="ivona:SENSE_1">lead</w> will be pronounced /ˈlɛd/, as in Lead is very heavy.
As mentioned at the very beginning of this text, it is sometimes necessary to modify texts to be synthesized in order to make them compatible with the system constraints and achieve the expected output. IVONA provides a set of special characters that work only in certain contexts, changing the way texts are being synthesized in terms of pronunciation or intonation. The characters are language-specific and do not apply to other languages unless specified otherwise in the language-specific documentation.
_a will be pronounced ey. This is to disambiguate the letter a in contexts in which the synthesizer would recognize input a as the indefinite article a.
A question mark followed by caret also known as circumflex (?^) can be used to force the intonation of a question to rise. Wh-questions (questions starting with an interrogative pronoun) by default have falling intonation. This can be changed by appending a caret to the question mark.
How are you?^ will result in a rising intonation.
A question mark followed by an underscore (?_) can be used to force the intonation of a question to fall. Yes/No questions by default have a rising intonation. This can be changed by appending the underscore character to the question mark.
Are you all right?_ will result in a falling intonation.