TTS Documentation

SSML Support in Ivona Text-To-Speech

1. SSML Specification

IVONA 2 Text-To-Speech supports both SSML 1.0 and SSML 1.1 as defined by the following standards:

Specific version may be chosen explicitly by setting the version attribute of <speak> document element.

The default version is SSML 1.1.

For convenience, SSML 1.0 specific constructs are handled in SSML 1.1 documents and vice-versa. However, in such cases a warning is issued. There is only one case when the same markup would be interpreted differently by different SSML versions (prosody rate unsigned percentage format).

1.1. Exceptions to SSML Standards

IVONA 2 Text-To-Speech is a Conforming Speech Synthesis Markup Language Processor, as defined by SSML 1.1. However, it is not a Conforming Extended Speech Synthesis Markup Language Processor, because of the following exceptions:

  • The supported <prosody> changes are volume and rate.

  • Separate tokens within a <w> or <token> are still being treated as separate tokens for lexical lookup purposes.

  • The supported audio file format is WAV (RIFF header) 16-bit mono PCM.

Detailed description of SSML interpretation for all elements and attributes is given below.

2. Interpretation of Elements and Attributes

There are some exceptions, when IVONA 2 TTS does not fully support SSML standards. There are also cases, when SSML standards do not impose restrictions and leave interpretation decisions to SSML processors. Such cases are described in detail in the following subsections.

2.1. General Attributes

2.1.1. Language: xml:lang Attribute

IVONA 2 TTS issues a warning whenever the value of xml:lang attribute does not match language of the current voice.

2.1.2. Language Speaking Failure: onlangfailure Attribute

Values defined by SSML standard:

  • changevoice — Interpreted as defined by the SSML standards.

  • ignoretext — Text in scope of this element is not sythesized. This may be overriden in descendant elements either by changing the language (xml:lang) to one supported by the voice, or by overriding the value of onlangfailure attribute to anything else than ignoretext.

  • ignorelang — Interpreted as defined by the SSML standards.

  • processorchoice — Interpreted as defined by the SSML standards. The default value is ignorelang.

2.2. speak Root Element

2.2.1. Version: version Attribute

Specific SSML version may be chosen explicitely by setting the version attribute of <speak> document element to either 1.0 or 1.1. The default is 1.1. For convenience, SSML 1.0 specific constructs are handled in SSML 1.1 documents, and vice-versa. There are only a few cases when the same SSML document fragment would be interpreted differently by different SSML versions.

2.2.2. Trimming: startmark and endmark Attributes

Interpreted as defined by the SSML standards.

2.2.3. Base URI: xml:base Attribute

Interpreted as defined by the SSML standards.

2.3. lexicon Element

Interpreted as defined by the SSML standards.

Lexicons of type other than application/pls+xml are not supported.

2.4. lookup Element

Interpreted as defined by the SSML standards.

2.5. meta and metadata Elements

Ignored.

2.6. p and s Elements

As defined by SSML standard.

Include either strong (for <s>) or extra strong (for <p>) prosodic breaks directly before and after element’s content.

2.7. token and w Elements

Standard tokenization rules are used for the contents of token and w nodes, so, contrary to the SSML standard, it might be broken down into multiple tokens.

The role attribute is interpreted as defined by the PLS 1.0 specification.

The role attribute may also be used for explicit homograph disambiguation for American and British English voices. This attribute may include the following values:

  • ivona:VB — Interpret the word as a verb (present simple).

  • ivona:VBD — Interpret the word as a past participle.

  • ivona:NN — Interpret the word as a noun.

  • ivona:SENSE_1 — Use the non-default sense of the word, which has a different pronunciation.

Please see some examples in the Examples section.

2.8. say-as Element

Accepted values of the interpret-as attribute:

  • characters, spell-out — Spell out each letter. The detail attribute may be used for grouping of characters, as defined by SSML 1.0 say-as attribute values W3C Working Group Note

  • cardinal, number — Interpret the value as a cardinal number.

  • ordinal — Interpret the value as an ordinal number.

  • digits — Spell each digit separately.

  • fraction — Interpret the value as a fraction.

  • unit — Interpret a value as a measurement. (Only American and British English voices).

  • date — Interpret the value as a date. The format attribute must be set to any of the following: mdy, dmy, ymd, md, dm, ym, my, d, m, y. The VXML date format YYYYMMDD with ?? is also supported. In that case, the format attribute is ignored.

  • time — Interpret a value such as 1’21" as duration in minutes and seconds.

  • duration — Interpret as duration (only English and German voices):

    • duration in ISO8601 format such as P1H30M2S;

    • a value such as 1’21" as minutes and seconds;

    • a value such as 1:21:31, 1:21, or 1 — the format attribute should be set to any of the following: hms, hm, ms, h, m, s.

  • telephone — Interpret a value as a telephone number. (Only American and British English voices).

  • address — Interpret a value as part of street address. (Only American and British English voices).

  • radiostation — Interpret a value as U.S. radio call sign or radio frequency. (Only English voices).

Please refer to text interpretation documentation for specific languages for more information on how the say-as element is handled in those languages.

IVONA 2 TTS user lexicon is not applied to text within say-as elements.

2.9. phoneme Element

The supported phonetic alphabets are:

  • "ipa" — The International Phonetic Alphabet (IPA).

  • "x-sampa" — The Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA).

  • "nt-sampa" or "navteq" — The NT-SAMPA phonetic transcription, as defined by NAVTEQ™. This transciption is language-dependent. The particular language to be used is determined by the currently active xml:lang attribute, or by specifying the alphabet explicitly as "nt-sampa-XXX" (or "navteq-XXX"), where the XXX part is a three-letter code of phonetic language, as defined by NAVTEQ™.

2.10. sub Element

Interpreted as defined by the SSML standards.

2.11. lang Element

Interpretation of xml:lang and onlangfailure attributes is given the General attributes section above.

2.12. voice Element

Interpreted as defined by the SSML standards.

Interpretation of voice attributes:

  • gender — Interpreted as defined by the SSML standards.

  • age — Interpreted as defined by the SSML standards. Additionally, following values are accepted: child, teen, adult and senior.

  • variant — This attribute is always applied as the last one; if the value of the variant attribute exceeds the number of available voices, then it will be wrapped to match a voice.

  • name — Interpreted as defined by the SSML standards. The names are compared caseless and diacritic-less, so for example the name “penelope” will match the voice “Penélope”.

  • languages — The accent is ignored; if both xml:lang and languages are present, the latter is used, even if SSML version is 1.0.

  • required — Ignores the variant value.

  • ordering — Ignores the variant value.

  • onvoicefailure — Interpreted as defined by the SSML standards. The default value is priorityselect.

2.13. emphasis Element

For now, IVONA 2 TTS handles emphasis elements by adjusting volume and speaking rate.

Interpretation of distinct level attribute values:

  • strong — Strong increase of volume and decrease of speaking rate.

  • moderate — Moderate increase of volume and decrease of speaking rate.

  • none — No prosodic change.

  • reduced — Decrease of volume, increase of speaking rate.

  • Other values or missing attribute — same as moderate.

2.14. break Element

Interpreted as defined by the SSML standards.

Detailed interpretation of distinct strength attribute values:

  • none — Render adjacent words as if there was no punctuation in between them.

  • x-weak — Same as none.

  • weak — Same as medium.

  • medium — Treat adjacent words as if separated by a single comma.

  • strong — Make a sentence break.

  • x-strong — Make a paragraph break.

  • Other values or missing attribute — same as medium.

Maximum break time is 120 seconds.

2.15. prosody Element

2.15.1. pitch Attribute

IVONA interprets some of the formats defined by the SSML standards:

  • "default" — Reset pitch to the default value for current voice.

  • "x-low", "low", "medium", "high", "x-high" — Set voice pitch to a predefined value.

  • "+n%", "-n%" (signed percentage) — Percentage relative change. A value of "+0%" means no baseline pitch change, "+5%" gives a little higher baseline pitch, "-5%" results in a little lower baseline pitch.

  • "n%" (unsigned percentage) — Same as the signed percentage format defined above. Causes deprecation warning in SSML 1.1.

Other pitch attribute value formats defined by SSML standards are currently not supported and issue a warning.

2.15.2. contour Attribute

Ignored.

2.15.3. range Attribute

Ignored.

2.15.4. rate Attribute

Interpreted as defined by the SSML standards.

Tip
Various numeric values defined by SSML standards are quite complicated. Sometimes the values refer to current value, sometimes to a default value for the voice. Relative values may be specified as multipliers, or as absolute changes. Moreover, some numeric values have different meaning in SSML 1.0 and SSML 1.1. Therefore, it is advised to refrain from using numeric values, and use only the named values instead.

Valid formats are:

  • "default" — Reset speaking rate to default for current voice.

  • "x-slow", "slow", "medium", "fast", "x-fast" — Set speaking rate to a predefined value for current voice.

  • "n", "+n" (non-negative number) — Multiplier of the default speaking rate for the voice. A value of "1" means the default speaking rate, "2" means twice the default speaking rate, "0.5" means half the default speaking rate.

    This format is deprecated in SSML 1.1 documents.

  • "+n%", "-n%" (signed percentage) — Percentage relative change. A value of "+0%" means no change in speaking rate, "+100%" means twice the current speaking rate, "-50%" means half the current speaking rate.

    This format is deprecated in SSML 1.1 documents.

  • "n%" (unsigned percentage) — Interpretation depends on SSML version:

    • SSML 1.0: Percentage relative change, same as signed percentage format defined above.

    • SSML 1.1: Multiplier of the default speaking rate for the voice. A value of "100%" means the default speaking rate, "200%" means twice the default, "50%" means half the default speaking rate.

2.15.5. duration Attribute

Ignored.

2.15.6. volume Attribute

Interpreted as defined by the SSML standards.

Valid formats are:

  • "default" — Reset volume to default for current voice.

  • "silent", "x-soft", "soft", "medium", "loud", "x-loud" — Set volume to a predefined value for current voice.

  • "+ndB", "-ndB" — Change relative to current volume level. A value of "+0dB" means no change of volume, "+6dB" means approximately twice the current amplitude, "-6dB" means approximately half the current amplitude.

  • "n" (unsigned number) — Percentage multiplier of the default amplitude. A value of "100" means the default volume, "200" means twice the default volume, "50" means half the default volume.
    This format is deprecated in SSML 1.1 documents.

  • "+n", "-n" (signed number) — Relative change as percentage of the default volume for the voice. A value of "+0" means no change of volume, "+100" means increase the current amplitude by 100% of the default, "-50" means decrease the current amplitude by 50% of the default amplitude.
    This format is deprecated in SSML 1.1 documents.

  • "n%", "+n%", "-n%" (percentage) — Percentage relative change. A value of "0%" means no change in volume, "100%" means twice the current volume, "-50%" means half the current volume.
    This format is deprecated in SSML 1.1 documents.

Each voice has a maximum volume level. When volume attribute setting would exceed this threshold, effectively the maximum volume is set.

2.16. audio Element

Interpreted as defined by the SSML standards.

Attributes from the Extended profile (clipBegin, clipEnd, repeatCount, repeatDur, soundLevel and speed) are also supported.

The supported audio file format is WAV (RIFF header) 16-bit mono PCM. The audio will be resampled to match sample rate of the voice so there’s no need to employ additional resampling tools.

3. Handling of invalid input

In general, passed input documents should fully conform to either of the two supported standards. However, for the ease of use, some deviations are allowed. Whenever IVONA 2 TTS decides to accept such invalid input, a warning is issued.

XML parse errors

Non-well-formed XML documents are rendered only up to a first fatal XML parsing error. A warning is issued at the place of error and all following input is discarded.

Structure

The whole document should be placed within a single root document element. For that, any element is accepted.

The only case when an SSML document will be rejected and parsed as plain text is when ordinary text comes before start of the first XML element.

Case of element names

SSML element names are matched case-insensitively. However, the case of paired starting and ending tags must match.

XML namespaces

Elements without a declared XML namespace are treated as belonging to the SSML XML namespace (http://www.w3.org/2001/10/synthesis).

Invalid elements

Text within invalid and foreign elements is read aloud. A warning is issued for the topmost invalid element.

Invalid attributes

Attributes with invalid names are silently ignored.

SSML schema errors

For each SSML schema error a warning is issued.

When a nesting error occurs (for example a <p> within an <s>), IVONA 2 TTS attempts to do its best in order to render both elements. Refer to detailed element handling information.

Element names from draft versions of SSML standard.

Some element names which existed in draft SSML specifications are silently accepted, such as <paragraph> for <p> and <sentence> for <s>.

4. PLS Lexicons

PLS Lexicon files are loaded separately for each synthesized SSML document. The only lexicons loaded from within a document are the ones declared by a <lexicon> element and referred to within at least one <lookup>.

IVONA 2 Text-To-Speech supports PLS 1.0 lexicons referenced from SSML documents, as defined by Pronunciation Lexicon Specification (PLS) Version 1.0, W3C Recommendation 14 October 2008, http://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/.

Apart from per-document PLS lexicons, IVONA allows for PLS lexicons loaded globally for a voice. Such lexicons work for all documents synthesized with given voice. Please note that the PLS lexicon will be applied only to those text fragments, for which the language (xml:lang attribute) matches with that of the PLS lexicon.

4.1. Exceptions to PLS Standard

The PLS standard is fully supported. The supported phonetic alphabets are the same as for the phoneme element.

5. IVONA 2 TTS User Lexicon

IVONA 2 TTS user lexicon consists of regex rules which are applied separately to each text node of parsed SSML documents. This imposes the following consequences:

  • It is not possible to modify SSML markup using dictionary rules.

  • Dictionary rules cannot match parts of text from different text nodes. For example, a rule

  "([[:digit:]]) mph" "\1 miles per hour"

will never match in SSML document

  <speak>12<!-- some comment --> mph</speak>

because it will only be applied to texts "12" and " mph" independently.

IVONA 2 TTS regular expressions dictionary rules are applied to text after applying PLS rules.

6. Examples

    
      <speak xml:base="file://C:\audios\input.ssml">

            <!-- references the file "C:\audios\general.pls" -->
            <lexicon uri="general.pls" xml:id="general"/>

            <lookup ref="general">
                    <s>Mr. Jones was very happy back in <say-as interpret-as="date"
                    format="year">1997</say-as>.</s>
            </lookup>

            <audio src="file://C:\tunes\final.wav"/>
            <audio src="http://www.example.com/audio.wav"/>
      </speak>
    
  
    
      <speak>
          The American word for <phoneme alphabet="ipa" ph="təˈmɑː.təʊ"/>
          is <phoneme alphabet="x-sampa" ph='t@"meI.toU'/>.

          The parcel will arrive at 14:30 on
          <say-as interpret-as="date" format="dmy">5/10/2011</say-as>.

          U.S. population density is <say-as interpret-as="unit">33.7/km2</say-as>.

          Our address is <say-as interpret-as="address">283 N. 7th St., Suite 201A,
          Greenville, GA 65302</say-as>.

          <say-as interpret-as="unit">1pc</say-as> equals <say-as
          interpret-as="unit">3.26156ly</say-as>.
      </speak>
    
  
    
      <speak>
          <p>
              <s><prosody rate="slow">IVONA</prosody> means highest quality speech
              synthesis in various languages.</s>
              <s>It offers both male and female radio quality voices <break/> at a
              sampling rate of 22 kHz <break/> which makes the IVONA voices a
              perfect tool for professional use or individual needs.</s>
          </p>
      </speak>
    
  
    
      <speak>
          The word <say-as interpret-as="characters">read</say-as>
          may be interpreted as either
          the present simple form <w role="ivona:VB">read</w>,
          or the past participle form <w role="ivona:VBD">read</w>.

          The word <say-as interpret-as="characters">object</say-as>
          may be interpreted as either
          the verb <w role="ivona:VB">object</w>,
          or the noun <w role="ivona:NN">object</w>.

          <w role="ivona:SENSE_1">Lead</w> is very heavy.

          In most cases the pronunciation of an ambiguous word is chosen
          correctly and doesn't need to be explicitly marked.
        </speak>
        
      
 
Copyright © 2015 IVONA Software. All rights reserved. Terms of Use | Privacy Policy