Speech Cloud Documentation

Speech metadata (Speech marks)

Speech marks are metadata describing the speech. There are four different types of speech marks:

  • Sentence - Describes a sentence to be spoken.

  • SSML - Describes a <mark> element from the SSML input to be spoken.

  • Viseme - Describes a viseme corresponding to a phoneme to be spoken.

  • Word - Describes a word to be spoken.

Speech marks can be obtained through the CreateSpeech method in one of two available formats:

  • JSON-encoded subtitles in an MP4 container

  • A simple plain text format

The following sections describe those two options in detail.

Speech marks as subtitle

Speech marks are encoded as subtitles in a container, currently supported only in a fragmented MP4 (fMP4, based on ISO/IEC 14496-12:2012). The container has two streams: the first one is the audio stream with the speech and the second one contains the speech marks encoded as MPEG-4 Timed Text (ISO/IEC 14496-17:2006). Each timed text frame contains a JSON-encoded array of speech mark objects. The presentation timestamp of the timed text frame should not be used, because it does not reflect the point in time to which the speech marks refer, and is subject to change.

A speech mark object contains the following fields:

  • time: timestamp in milliseconds of the given correlation in the output audio stream.

    • type: integer

  • type: type of speech mark.

    • type: string

  • start: 0-based byte offset in the input text (except for viseme marks).

    • type: integer

  • end: 0-based byte offset in the input text (except for viseme marks).

    • type: integer

  • value:

    • type: string

    • varies depending on the type of speech mark.

      • SSML: SSML mark.

      • Viseme: viseme name; more information available here

      • else: substring of the input text. The substring begins at byte start and extends to the character at byte end.

Consider the following speech mark object from the input text "Ah! Today will be a great day.",

{"time":483,"type":"sentence","start":4,"end":30,"value":"Today will be a great day."}

It states the sentence "Today will be a great day." will be spoken at 483 milliseconds and the sentence starts at byte 4 and ends at byte 30 (0-based) of the input text.

To request speech marks in this format, you need to call the CreateSpeech method with OutputFormat.Codec set to MP4 and have at least one of the OutputFormat.SpeechMarks enabled. For more information, please refer to the IVONA Speech Cloud API Reference.

An example use case for this option is to call the CreateSpeech method, extract the speech marks from the container, and then use them together.

Speech marks as a request

Speech marks are returned as plain text in space-separated values format. Each line in the output is of the following format

<time> <type> <start> <end> <value>\n

where each field is the same as in the JSON format (please see Speech marks as subtitle) except <time> which is in seconds.

Note that viseme speech marks also return <start> and <end> values but do not have any meaning in that case.

Consider the following speech mark output from the input text "Ah! Today will be a great day.",

0.483 sentence 4 30 Today will be a great day.

It states the sentence "Today will be a great day." will be spoken at 0.483 second and the sentence starts at byte 4 and ends at byte 30 (0-based) of the input text.

To request speech marks in this format, you will need to call the CreateSpeech method with OutputFormat.Codec set to SPEECHMARK and have at least one of the OutputFormat.SpeechMarks enabled. Note that no audio will be sent back. For more information, please refer to the IVONA Speech Cloud API Reference.

An example use case for this option is to call the CreateSpeech method twice, one with the audio stream and the other with speech marks. Because the two requests are independent, they need to be synchronized to be used together; on the other hand, this format is easier to process on platforms that don’t offer support for accessing the streams in an MP4 container, such as web browsers.

 
Copyright © 2015 IVONA Software. All rights reserved. Terms of Use | Privacy Policy