IVONA For Developers

Develop with IVONA Text-to-Speech.

6. Text highlighting

<< Synthesis in multiple threads


PHP

Download source code

Introduction

It is popular to highlight on the screen elements of text which are currently read. IVONA SpeechCloud supports such functionality also. This is known as text highlighting or text marking. In this tutorial step we will show how to realize text highlighting using IVONA SpeechCloud.

Available speech marks

First lets shortly scan through types of marks available from IVONA SDK. Basically there are four types of speech marks:

Sentence

This marks represents start of next word in synthesized text. There are three important parameters for this mark

  • Sample offset refers to offset of sentence in synthesized audio file.
  • Start text offset represents first character of sentence.
  • End text offset represents character after last character of sentence.

Word

This marks represents start of next word in synthesized text. There are three important parameters for this mark

  • Sample offset refers to offset of word in synthesized audio file.
  • Start text offset represents first character of word.
  • End text offset represents character after last character of word.
Viseme

This mark is different than two previous as it represents specific type of information. A viseme is a generic facial image that can be used to describe a particular sound. There are three important parameters for this mark:

  • Sample offset refers to offset of word in synthesized audio file.
  • Name of symbol represents type of viseme spoken in this very moment according to x-sampa standard.

Representation of x-sampa differ for different languages. For closer information about x-sampa visemes please refer to Phonetic Alphabet Support.

SSML

It is also possible to insert custom SSML marks into the text and have those marks returned through text marking mechanism. There are four important parameters for this mark

  • Sample offset refers to offset of word in synthesized audio file.
  • Start text offset represents first character of word.
  • End text offset represents character after last character of word.
  • Name that represents name of SSML mark inserted into the text

Requesting synthesis with marks

In order to handle synthesis with marks, we have added new function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/**
 * Function synthesizing a text with marks
 * Returned variable is an array containing url of the sound file and url of the text file with marks
 *
 * @param string $text UTF-8 encoded string to synthesize 
 * @return array URL addresses for sound file and text file with marks synthesized from $text parameter, or false in case of error
 */
public function synthesizeWithMarks($text, $text_type) {
	// the configuration (could be moved to constants section of the website)
	// wsdl URL
	$wsdl = 'http://api.ivona.com/saasapiwsdl.xml';
 
	// soap client initialization (it requires soap client php extension available)
	$Binding = new SoapClient($wsdl,array('exceptions' => 0));
 
	// getToken for the next operation
	$input = array('user' => USER);
	$token = $Binding->__soapCall('getToken', $input);
	if (is_soap_fault($token)) {
		error_log('API call: getToken error: '.print_r($token,1));
		return false;
	}
 
	// additional parameters
	$params = array();
	//$params[]=array('key'=>'Prosody-Rate', 'value'=>PROSODY_RATE); // example value for the new text speed
 
	// createSpeechFileWithMarks (store text in IVONA.com system, invoke synthesis and get the link for the speech file and for the text file with marks)
	$input = array('token' => $token,
			'md5' => md5(md5(API_KEY).$token),
			'text' => $text,		
			'contentType' => $text_type,
			'voiceId' => SELECTED_VOICE,
			'codecId' => 'mp3/22050',
			'params' => $params,
		      );
	$fileData = $Binding->__soapCall('createSpeechFileWithMarks',$input);
	if (is_soap_fault($fileData)) {
		error_log('API call: createSpeechFileWithMarks error: '.print_r($fileData,1));
		return false;
	}
 
        // return the urls
        return array('soundUrl'=>$fileData['soundUrl'], 'marksUrl'=>$fileData['marksUrl']);
}

We just need to call this function and retrieve files located at both returned urls.

Check result

Lets start the program with additional request for speech marks:

./ivonaAPIClient_6.php --text_with_marks '<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd" xml:lang="en-US"><p>Here is my text. And here the <mark name="my_mark"/>additional ssml mark.</p></speak>' "text/ssml"

Returned text after requesting the url for the speech marks file

0.000 sentence 250 266 
0.061 word 250 254 
0.061 viseme 0 0 k
0.156 viseme 0 0 i
0.248 viseme 0 0 r
0.327 word 255 257 
0.327 viseme 0 0 i
0.375 viseme 0 0 s
0.477 word 258 260 
0.477 viseme 0 0 p
0.559 viseme 0 0 a
0.660 word 261 265 
0.660 viseme 0 0 t
0.786 viseme 0 0 E
0.933 viseme 0 0 k
1.024 viseme 0 0 s
1.105 viseme 0 0 t
1.155 viseme 0 0 sil
1.220 sentence 267 323 
1.587 word 267 270 
1.587 viseme 0 0 @
1.652 viseme 0 0 t
1.713 viseme 0 0 t
1.763 word 271 275 
1.763 viseme 0 0 k
1.868 viseme 0 0 i
2.016 viseme 0 0 r
2.078 word 276 279 
2.078 viseme 0 0 T
2.153 viseme 0 0 i
2.254 ssml 280 302 my_mark
2.254 word 302 312 
2.254 viseme 0 0 @
2.293 viseme 0 0 t
2.385 viseme 0 0 i
2.435 viseme 0 0 S
2.552 viseme 0 0 t
2.621 viseme 0 0 t
2.739 word 313 317 
2.739 viseme 0 0 E
2.823 viseme 0 0 s
2.973 viseme 0 0 E
3.073 viseme 0 0 s
3.200 viseme 0 0 E
3.249 viseme 0 0 p
3.362 viseme 0 0 E
3.444 viseme 0 0 t
3.527 word 318 322 
3.527 viseme 0 0 p
3.603 viseme 0 0 a
3.722 viseme 0 0 r
3.805 viseme 0 0 k
3.876 viseme 0 0 sil

Conclusion

In above paragraphs we showed how to request synthesis with additional speech marks data returned from IVONA Speech Cloud

Links to the speech file and text file with marks are returned almost immediately and we can start streaming both files in parallel playing returned speech file and using marks data to highlight the text during the play.

Complete example

Download source code