Text-to-Speech and SSML Support
Text-to-Speech (TTS)
The <Say> verb is used to convert text into a human-like speech real-time. All you need is to provide the text in the Visual designer’s Say element and Restcomm will synthesize speech and playback the audio. The default TTS provider is Amazon Polly. A default US English dialect is used with a male voice.
When using <Say> you have a choice between using male or female Google or Amazon Polly voices.
Speech Synthesis Markup Language (SSML)
You can send Speech Synthesis Markup Language (SSML) in your Text-to-Speech request to allow for more customization in your audio response by providing details on pauses, and audio formatting for acronyms, dates, times, abbreviations, or text that should be censored.
Supported Voices and languages
For detailed information about all supported languages and voice with Amazon Polly and Google please visit the following resources:
-
Google supported voices and languages
Examples
SSML Markup and Text-to-Speech Synthesizes of The Text
<speak> This is a <say-as interpret-as="characters">SSML</say-as> example. I can pause <break time="3s"/>. I can play a sound <audio src="https://www.example.com/MY_MP3_FILE.mp3">didn't get your MP3 audio file</audio>. I can speak in cardinals. Your number is <say-as interpret-as="cardinal">10</say-as>. Or I can speak in ordinals. You are <say-as interpret-as="ordinal">10</say-as> in line. Or I can even speak in digits. The digits for ten are <say-as interpret-as="characters">10</say-as>. I can also substitute phrases, like the <sub alias="World Wide Web Consortium">W3C</sub>. Finally, I can speak a paragraph with two sentences. <p><s>This is sentence one.</s><s>This is sentence two.</s></p> </speak>
Below is the synthesized text for the example SSML document:
This is a S S M L samples. I can pause [3 second pause]. I can play a sound [audio file plays]. I can speak in cardinals. Your number is ten. Or I can speak in ordinals. You are tenth in line. Or I can even speak in digits. The digits for ten are one oh. I can also substitute phrases, like the World Wide Web Consortium. Finally, I can speak a paragraph with two sentences. This is sentence one. This is sentence two.
The Google Cloud Text-to-Speech supports a subset of available SSML tags.
For more information about how to create audio data from SSML input with the Google Cloud Text-to-Speech, see Creating Voice Audio Files.
Google Cloud Support for SSML Elements
You can use various SSML elements and options for your actions. For more information check out Google Cloud Support for SSML elements.
Amazon Polly Support for SSML Elements
For more information about Amazon Polly supported SSML tags visit Amazon Polly Supported SSML Tags.
Using Speech Synthesis Markup Language (SSML) in Visual Designer
You can use SSML within a <Say> verb in Visual designer as shown below.
-
Click on the gear icon to expand the <Say> verb settings. You will notice a
Language
drop-down field. Select the desired language. -
Select the male or female icon next to the
Language
field to set a voice variation. -
Save your application.

Using Speech Synthesis Markup Language (SSML) in RCML
You can use SSML in your RCML applications as follows to create pauses, and audio formatting for acronyms, dates, times, abbreviations, or text that should be censored.
The <emphasis> element can be used to add or remove emphasis from text contained by the element as follows.
<Response> <Say voice="woman" language="en" loop="3"> <speak> <emphasis level="moderate">This is an important announcement</emphasis> </speak> </Say> </Response>
The <break> element lets you control pausing or other prosodic boundaries between words. Using <break> between any pair of tokens is optional. If this element is not present between words, the break is automatically determined based on the linguistic context.
This element accepts two optional attributes:
-
time
: Sets the length of the break by seconds or milliseconds (e.g. "3s" or "250ms"). -
strength
: Sets the strength of the output’s prosodic break by relative terms. Valid values are: "x-weak", weak", "medium", "strong", and "x-strong". The value "none" indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break that the processor would otherwise produce. The other values indicate monotonically non-decreasing (conceptually increasing) break strength between tokens. The stronger boundaries are typically accompanied by pauses. The following example shows how to use the <break> element to pause between steps:
<Response> <Say voice="woman" language="en" loop="3"> <speak> Step 1, take a deep breath. <break time="200ms"/> Step 2, exhale. Step 3, take a deep breath again. <break strength="weak"/> Step 4, exhale. </speak> </Say> </Response>
The <say‑as> lets you indicate information about the type of text construct that is contained within the element. It also helps specify the level of detail for rendering the contained text.
The <say‑as> element has the required attribute, interpret-as, which determines how the value is spoken. Optional attributes format and detail may be used depending on the particular interpret-as value. The interpret-as attribute supports the following values:
cardinal
The following example is spoken as "Twelve thousand three hundred forty five" (for US English) or "Twelve thousand three hundred and forty five (for UK English)":
<Response> <Say voice="woman" language="en" loop="3"> <speak> <say-as interpret-as="cardinal">12345</say-as> </speak> </Say> </Response>
ordinal
The following example is spoken as "First":
<Response> <Say voice="woman" language="en" loop="3"> <speak> <say-as interpret-as="ordinal">1</say-as> </speak> </Say> </Response>
characters
The following example is spoken as "C A N":
<Response> <Say voice="woman" language="en" loop="3"> <speak> <say-as interpret-as="characters">can</say-as> </speak> </Say> </Response>
expletive or bleep
The following example comes out as a beep, as though it has been censored:
<Response> <Say voice="woman" language="en" loop="3"> <speak> <say-as interpret-as="expletive">censor this</say-as> </speak> </Say> </Response>
verbatim or spell-out
The following example is spelled out letter by letter:
<Response> <Say voice="woman" language="en" loop="3"> <speak> <say-as interpret-as="verbatim">abcdefg</say-as> </speak> </Say> </Response>
date
The format attribute is a sequence of date field character codes. Supported field character codes in format are {y, m, d} for year, month, and day (of the month) respectively. If the field code appears once for year, month, or day then the number of digits expected are 4, 2, and 2 respectively. If the field code is repeated then the number of expected digits is the number of times the code is repeated. Fields in the date text may be separated by punctuation and/or spaces.
The detail attribute controls the spoken form of the date. For detail='1' only the day fields and one of month or year fields are required, although both may be supplied. This is the default when less than all three fields are given. The spoken form is "The \{ordinal day} of {month}, {year}".
The following example is spoken as "The thirtieth of September, two thousand and nineteen":
<Response> <Say voice="woman" language="en" loop="3"> <speak> <say-as interpret-as="date" format="yyyymmdd" detail="1"> 2019-09-30 </say-as> </speak> </Say> </Response>
The following example is spoken as "The thirtieth of September":
<speak> <say-as interpret-as="date" format="dm">30-9</say-as> </speak>
If you are looking for building more complex SSML scenarios make sure to check out the Google Cloud and Amazon Polly documentation pages.