The IVONA Text-to-Speech synthesizer is a versatile system that correctly transforms most written data into human-like, natural speech. The IVONA synthesizer operates on written, fully expanded words. However, input text documents contain not only full words, such as Milch and Zucker, but also various other language units, such as numbers (15), dates (3/4/2003), acronyms (ADAC, AG), abbreviations (z.B.), symbols ($), etc. All individual language units must be first consistently expanded into full words before they get synthesized. This conversion takes place internally within the synthesizer and is called text normalization.
The German IVONA Text-To-Speech voices correctly normalize and synthesize the majority of German texts. This document describes various text normalization processes that all written input data undergoes before being synthesized.
The text normalization processes can be extended by means of the IVONA regular expressions lexicon (described in a separate document) and by using PLS lexicons (W3C Recommendation) which are fully customizable by the end-user.
This section describes how unannotated input text is split into paragraphs, sentences and words.
Paragraphs are separated by empty lines.
Paragraphs may be explicitly marked with SSML elements p.
A sentence contains by default less than 1000 characters. Sentences longer than that will be broken into multiple smaller sentences.
Sentences may be explicitly marked with SSML elements s.
A word contains by default less than 100 characters. Words longer than that will be broken into multiple smaller words.
Words without any vowels will be spelled out.
IVONA accepts all Unicode characters. IVONA handles most characters found in texts based on the Latin script.
Punctuation plays a key role in the way texts are interpreted by the TTS system. IVONA supports majority of punctuation marks found in German texts. However, in the end all punctuation marks which have effect on pauses or intonation are mapped to the following marks.
rising or falling
This section describes in general how IVONA normalizes input text, excluding text fragments marked with the SSML say-as tag.
This section is not exhaustive. IVONA normalizes various text units but only the most common ones have been included in this description.
A cardinal number is either any single digit (0, 1, …, 9) or a sequence of digit not starting with 0.
Longer cardinal numbers may make use of a dot as a thousands separator.
10.000 will be pronounced zehntausend.
256 will be pronounced zweihundertsechsundfünfzig.
4358 will be pronounced viertausenddreihundertachtundfünfzig.
1.000 will be pronounced eintausend.
A signed integer consists of a sign character followed immediately by a cardinal number. Valid sign characters are the plus sign (+) and the minus sign (−, U+2212). The popular hyphen-minus character (-), as well as other dash-like characters, are also supported as the sign character, but they are ambiguous and should best be avoided.
+5 will be pronounced plus fünf.
−3.000 will be pronounced minus dreitausend.
A cardinal or signed integer followed immediately by a comma and a sequence of digits will be recognized as a real number.
4,5 will be pronounced vier komma fünf.
-3,1 will be pronounced minus drei komma eins.
1.000,12 will be pronounced eintausend komma zwölf.
A cardinal followed by a dot is interpreted as an ordinal number. There are exceptions from this rule when such a sequence is placed at the end of the sentence. In this case the dot will just mark the end of the sentence and will not influence the interpretation of the numeral. To get a number interpreted as an ordinal at the end of the sentence an additional dot should be applied.
21. will be pronounced einundzwanzigste.
42. will be pronounced zweiundvierzigste.
6. will be pronounced sechste.
1.000.000. will be pronounced millionste.
Ich bin der 5.. will be pronunced ich bin der fünfte.
Suffixes such as ter, ten and other inflections may be applied to any cardinal number.
60es will be pronounced sechziges.
200en will be pronounced zweihunderten.
7er will be pronounced siebener.
1000em will be pronounced eintausendem.
5ter. will be pronounced fünfter.
25ten will be pronounced fünfundzwanzigsten.
Numbers starting with 0 are always read as a sequence of digits.
0123 will be pronounced null eins zwei drei.
IVONA supports a certain number of currencies in multiple formats. Valid currency symbols include commonly used symbols such as £, $, €, ¥.
The number may be followed by the words Million or Milliard. In this case the currency will be pronounced at the end.
$10 will be pronounced zehn dollar.
5,27$ will be pronounced fünf dollar und siebenundzwanzig cent.
£5.27 will be pronounced fünf pfund und siebenundzwanzig pence.
1000,2¥ will be pronounced eintausend yen und zwanzig sen.
¥1 Million will be pronounced eine million yen.
5,27€ will be pronounced fünf euro und siebenundzwanzig cent.
€5,27 will be pronounced fünf euro und siebenundzwanzig cent.
IVONA supports many units in different contexts.
3°C will be pronounced drei grad celsius.
des 7. l will be pronounced des siebten liters.
wegen 1 s will be pronounced wegen einer sekunde.
gemessen in t will be pronounced gemessen in tonnen.
Some units can have SI prefixes, such as m for milli or G for giga. Bytes and bits can have IEC prefixes, also known as binary prefixes, such as Ki for kibi.
2 kg will be pronounced zwei kilogramm.
10 ml will be pronounced zehn milliliter.
1 KiB will be pronounced ein kibibyte.
Units raised to 2nd or 3rd power will be expanded accordingly.
1mm^2 will be pronounced ein quadratmillimeter.
10s2 will be pronounced zehn quadratsekunde.
3m³ will be pronounced drei kubikmeter.
It’s also possible to use compound units, i.e. where one is divided by another.
gemessen in km/h will be pronounced gemessen in kilometern pro stunde.
100Ew./km2 will be pronounced ein hundert einwohner pro quadratkilometer.
IVONA supports time specified in both the 12-hour and the 24-hour clock.
1:59 will be pronounced ein uhr neunundfünfzig.
2:00 will be pronounced zwei uhr.
01:59am will be pronounced ein uhr neunundfünfzig a m.
2 AM will be pronounced zwei a m.
13:00 will be pronounced dreizehn uhr.
One-digit numbers for the day and for the month may have an optional leading zero.
Supported formats for month expressions: numbers (4, 04), name (April), abbreviation (Apr).
The year can be expressed with either 2 or 4 digits.
European format (D/M/Y, D-M-Y, D.M.Y), default for all German voices:
12/Mai/1995 will be pronounced zwölfte mai neunzehn hundert fünfundneunzig.
12-Apr-2007 will be pronounced zwölfte april zweitausendsieben.
20.3.2011 will be pronounced zwanzigste märz zweitausendelf.
Standard US format (M/D/Y, M-D-Y, M.D.Y):
12/31/1999 will be pronounced einunddreißigste dezember neunzehn hundert neunundneunzig.
10-25-1999 will be pronounced fünfundzwanzigste oktober neunzehn hundert neunundneunzig.
Dez/31/1999 will be pronounced einunddreißigste dezember neunzehn hundert neunundneunzig.
April-25-1999 will be pronounced fünfundzwanzigste april neunzehn hundert neunundneunzige.
ISO 8601 standard (Y-M-D, Y/M/D, Y.M.D), handles only 4-digit years:
2007/01/01 will be pronounced erste januar zweitausendsieben.
2007-Jan-01 will be pronounced erste januar zweitausendsieben.
2007-Januar-01 will be pronounced erste januar zweitausendsieben.
Most abbreviations will be expanded to full words. There will be no sentence break on the dot sign (full stop) following a supported abbreviation. In order to force a sentence break please use two dot signs: one to mark the abbreviation and one to mark the sentence ending.
z.B. Hr. Fischer von Friedrichstr.. will be interpreted as zum beispiel herr fischer von friedrichstraße.
Initialisms with a period (dot) following each letter (e.g. U.S.A.) will be pronounced by spelling out each letter.
Most common initialisms without dots (e.g. USA, ARD) will be also recognized as such and properly pronounced.
All vowelless words are recognized as initialisms.
F.A.Z. will be pronounced f a z.
in den USA will be pronounced in den u s a.
ARD will be pronounced a r d.
pwq will be pronounced p w q.
IVONA currently supports various standard (DIN 5008 and E.123) and non-standard formats of telephone numbers and groups the digits in 2 or 3-digit numbers adding a pause after each group. It also recognizes extension numbers of certain formats and adds the word Durchwahl before such sequeces.
0180-1234050 will be pronounced as null eins, achtzig, einhundertdreiundzwanzig, vierzig, fünfzig.
0201 12-46542 will be pronounced as null zwei, null eins, zwölf, vierhundertfünfundsechzig, zweiundvierzig.
+49 (0) 6251 / 1 75 29 - 0 will be pronounced as plus neunundvierzig, null, zweiundsechzig, einundfünfzig, eins, fünfundsiebzig, neunundzwanzig Durchwahl null.
+49 30 588459-258 will be pronounced as plus neunundvierzig, dreißig, achtundfünfzig, vierundachtzig, neunundfünfzig Durchwahl zweihundertachtundfünfzig.
0043 5226 2789-20 will be pronounced as _null null vier drei, zweiundfünfzig, sechsundzwanzig, siebenundzwanzig, neunundachtzig Durchwahl zwanzig.
Non-words not described elsewhere will be treated as identifiers. This group includes mixes of letters and digits, such as r121, as well as URL’s, e-mail addresses, or fancy proper names unknown to the synthesizer.
Punctuation characters within identifiers will be pronounced.
er125lp will be pronounced er einhundertfünfundzwanzig l p.
http://www.ivona.com will be pronounced h t t p doppelpunkt schrägstrich schrägstrich w w w punkt ivona punkt com.
B!0 will be pronounced b ausrufezeichen null.
The SSML element say-as gives users the possibility to annotate fragments of text in order to force particular interpretation.
Marking a fragment with say-as disables most default normalization rules, which would have otherwise been applied. Therefore, it is advised to mark text with say-as scarcely, only when the default normalization rules fail and render different speech than expected by the user.
The standards authority W3C Working Group has issued a note SSML 1.0 say-as attribute values, which is mostly followed by IVONA.
IVONA will interpret a value as a date, when used within say-as with interpret-as="date". This works just as defined in the W3C note. The format attribute may be set to any of the following: mdy, dmy, ymd, md, dm, ym, my, y, d, m.
<say-as interpret-as="date" format="mdy">05/02/03</say-as> will be pronounced zweite mai zweitausenddrei.
<say-as interpret-as="date" format="dmy">05/02/03</say-as> will be pronounced fünfte februar zweitausenddrei.
<say-as interpret-as="date" format="ymd">05/02/03</say-as> will be pronounced dritte februar zweitausendfünf.
Tokens like 1:20:30, 1:20, or 1 can be recognized as duration in hours minutes and seconds by surrounding with say-as having interpret-as="duration". The format attribute should be set to any of the following: hms, hm, ms, h, m, s.
The same say-as tag may be used to recognize tokens like 7'10" as duration in minutes and seconds, and tokens in ISO8601 format like PT2H30M15S, P2D3H, or P2W. In this case the format attribute is ignored.
<say-as interpret-as="duration" format="hm">1:23</say-as> will be pronounced eine stunde und dreiundzwanzig minuten.
<say-as interpret-as="duration" format="ms">1:23</say-as> will be pronounced eine minute und dreiundzwanzig sekunden.
<say-as interpret-as="duration" format="ms">1:00</say-as> will be pronounced eine minute.
<say-as interpret-as="duration" format="hms">1:10:23</say-as> will be pronounced eine stunde zehn minuten und dreiundzwanzig sekunden.
<say-as interpret-as="duration" format="h">10</say-as> will be pronounced zehn stunden.
<say-as interpret-as="duration">1'23"</say-as> will be pronounced eine minute und dreiundzwanzig sekunden.
<say-as interpret-as="duration">P1Y2M3DT4H5M6S</say-as> will be pronounced ein jahr zwei monate drei tage vier stunden fünf minuten und sechs sekunden.
<say-as interpret-as="duration">P1DT12H</say-as> will be pronounced ein tag und zwölf stunden.
IVONA will attempt to read values within say-as having intepret-as="unit" as units. The context is also taken into account.
<say-as interpret-as="unit">2 kg</say-as> will be pronounced zwei kilogramm.
<say-as interpret-as="unit">3m³</say-as> will be pronounced drei kubikmeter.
gemessen in <say-as interpret-as="unit">km/h</say-as> will be pronounced gemessen in kilometern pro stunde.
<say-as interpret-as="unit">100Ew./km2</say-as> will be pronounced ein hundert einwohner pro quadratkilometer.
IVONA will read individual characters for text within the say-as element having interpret-as="characters". The format attribute is ignored. The detail attribute may be used to force pauses, as described in the W3C Note.
<say-as interpret-as="characters">achtzig</say-as> will be pronounced a c h t z i g.
<say-as interpret-as="characters">1a3BZ7</say-as> will be pronounced eins a drei b z sieben.
IVONA will attempt to read values within say-as having interpret-as="cardinal" as cardinal numbers. The format and detail attributes are ignored. Roman numerals are supported.
<say-as interpret-as="cardinal">13</say-as> will be pronounced dreizehn.
<say-as interpret-as="cardinal">C</say-as> will be pronounced einhundert.
IVONA will attempt to read values within say-as having interpret-as="ordinal" as ordinal numbers. The format and detail attributes are ignored. Roman numerals are supported.
<say-as interpret-as="ordinal">986</say-as> will be pronounced neunhundertsechsundachtzigste.
<say-as interpret-as="ordinal">C</say-as> will be pronounced hundertste.
IVONA will attempt to read values within say-as having interpret-as="digit" as digits.
<say-as interpret-as="digits">123</say-as> will be pronounced eins zwei drei.
<say-as interpret-as="digits">C</say-as> will be pronounced eins null null.
IVONA will interpret values within say-as having interpret-as="fraction" as common fractions. The syntax for fractions is following:
["+" | "−"] cardinal "/" cardinal.
where cardinal is a number as defined in Cardinal numbers above.
<say-as interpret-as="fraction">15/2</say-as> will be pronounced fünfzehn zweitel.
<say-as interpret-as="fraction">-1/2</say-as> will be pronounced minus ein zweitel.
As mentioned at the very beginning of this text, it is sometimes necessary to modify texts to be synthesized in order to make them compatible with the system constraints and achieve the expected output. IVONA provides a set of special characters that work only in certain contexts, changing the way texts are being synthesized in terms of pronunciation or intonation. The characters are language-specific and do not apply to other languages unless specified otherwise in the language-specific documentation.
The non-ASCII German letters ä, ö, ü and ß can be replaced by two-letter combinations (ae, oe, ue and ss respectively) commonly used when typing of German special characters is not possible or difficult. In most cases the pronunciation of both versions should be identical.
Bürger will be pronounced the same way as Buerger.
A question mark followed by a caret also known as circumflex (?^) can be used to force the intonation of a question to rise. Wh-questions (questions starting with an interrogative pronoun) by default have falling intonation. This can be changed by appending a caret to the question mark.
Wo bist du?^ will result in a rising intonation.
A question mark followed by an underscore (?_) can be used to force the intonation of a question to fall. Yes/No questions by default have a rising intonation. This can be changed by appending the underscore character to the question mark.
Alles klar?_ will result in a falling intonation.