Input protocol

This page describes what requests from client to server look like, and what clients can parameter.

Commands

Commands are request methods from the client to the server. They allow the former to set up parameters to prepare ASR, as well as requesting transcription itself.

SET-PARAMS

Allow client to define recognition parameters for the session's duration, or until another SET-PARAMS command replaces them.

GET-PARAMS

Asks for current session parameters.

DEFINE-GRAMMAR

Define a grammar and its options, and name it, in order to use it later on during recognition. For example a spelling digits grammar with a length of 7 characters.

The definition is valid for the duration of the session.

Only builtin grammars are supported, not custom grammars sent as SRGS XML.

RECOGNIZE

Client requests server to start speech recognition. Multiple active grammars are supported without weights.

The command can temporarily override any parameter previously set with SET-PARAMS. the recognition parameters (including the chosen grammars) define precisely how and when the recognition stops.

The success response to this command is a RECOGNITION-IN-PROGRESS event, which includes a unique request-id for the session's duration.

STOP

Client tells server to stop active ongoing RECOGNIZE, if any exists.

For example if user hangs up, you'll want to send this command because there is no more point in transcribing anymore.

START-INPUT-TIMERS

In case of barge-in, the RECOGNIZE command would not start the timers. Sending this command will start the No Input Timer.

Resource headers

It is highly recommended that you set your own values using command SET-PARAMS. We might change default values without notice, and they should be considered as a backup.

Confidence-Threshold

A float between 0 and 1. Default value: 0.5.

ASR and interpretation determine a confidence value for the provided result. Value can be changed for your session depending on your needs. Raising the value closer to 1 might be useful in order to increase your certainty in the result, as it might sometimes happen that noise will disturb ASR and therefore the grammar's result.

If output confidence value is below this threshold, a no-match will be returned.

No-Input-Timeout

Starts after a RECOGNIZE command if start-input-timer header is true, or after a START-INPUT-TIMER command if the same header is false. In the end, no voice has been detected

In milliseconds, an integer greater than or equal to 0. Default value: 5300.

When recognition is started and there is no speech detected for a certain period of time, the recognizer can send a Recognition Complete event to the client with a Completion-Cause of "no-input-timeout" and terminate the recognition operation.

This header allows client to set this timeout.

Recognition-Timeout

Despite user is still talking, nothing matching the grammar has been found for a while now

In milliseconds, an integer greater than or equal to 0. Default value: 30000.

When recognition is started and there is no match for a certain period of time, the recognizer can send a Recognition Complete event to the client and terminate the recognition operation.

This header allows client to set this timeout.

This timeout is useful in order not to keep recognition going in case end-user is talking aside, or telling things outside the expected scope of the use case.

Start-Input-Timers

Boolean. Default value: true.

With this header set to false, recognition can be started, without starting the no-input timer yet. A Start-Input-Timers request is needed to start these timers then.

This header together with the Start-Input-Timer command is useful to implement barge-in.

Speech-Complete-Timeout

VAD spots that user stopped talking, and something matching the grammar was found

In milliseconds, an integer greater than or equal to 0. Default value: 1000.

The length of silence required following user speech before the recognizer finalizes a result. This value applies when the result is a complete match against an active grammar.

If the result ends up being incomplete, Speech-Incomplete-Timeout will apply.

Speech-Incomplete-Timeout

VAD spots that user stopped talking, and something partially matching the grammar was found

In milliseconds, an integer greater than or equal to 0. Default value: 2000.

The length of silence required following user speech before the recognizer finalizes a result. This value applies when the result is a partial match against all active grammars.

If the result ends up being complete, Speech-Complete-Timeout will apply.

For more details, please check RFC-6787.

Make sure this parameter's value is greater than Speech-Complete-Timeout's one.

Content-Type

A string. As we only support builtin grammars, its value must always be text/uri-list

The list of grammars is then passed into the body.

Only applies to commands DEFINE-GRAMMAR and RECOGNIZE.

Content-ID

A String.

Set an ID for the defined grammar, with which it can be referenced for the duration of the session.

For example, a grammar can be defined with the ID account_number using a spelling digits builtin of length 8, because for the IVR's scenario, we're asking user for its account/contract number. The Content ID will then be available for the duration of the session under session: namespace, that is session:account_number.

Only applies to command DEFINE-GRAMMAR.

Sensitivity-Level

ignored since version 1.6.1

Used to filter background noise. After experiments and measures, it is now transparent for clients, and only dealt with on server side.

Speech-Language

A String following RFC 5646.

Defines the language to be used for transcription.

For the moment we support:

  • fr, fr-FR,
  • en, en-US, en-GB,
  • es (beta)
  • Some business domain extensions, such as banking, telecom, insurance, travel... Reach out to your account manager for more information.
  • Custom language models built with our LM Factory, that best meet your needs and use cases. Reach out to your account manager for more information.

Hotword-Min-Duration

The minimum duration user must talk to consider a match in hotword

In milliseconds, an integer above 0. Default is undefined.

The minimum duration of an utterance that will be considered for hotword recognition.

For example, if you're expecting the user to give their phone number as a spelling of digits, it is reasonable to expect them to take more than one or two seconds to do so. By setting this duration, result interpretation can be tuned for better performances.

Only valid for commands RECOGNIZE and SET-PARAMS. It will only apply for hotword mode.

Hotword-Max-Duration

The maximum duration user must talk to consider a match in hotword

In milliseconds, an integer above 0. Default is undefined.

The maximum duration of an utterance that will be considered for hotword recognition.

For example, if you're expecting the user to give a keyword after offering them 3 different choices, it is reasonable to expect them to take less than five or six seconds to do so. By setting this duration, result interpretation can be tuned for better performances.

Only valid for commands RECOGNIZE and SET-PARAMS. It will only apply for hotword mode.

Recognition-Mode

String, either normal or hotword. Default is normal.

Only valid for command RECOGNIZE.

More information about both normal mode and hotword mode can be found below.

Extended resource headers

We provide some extended parameters that can be specified in the Vendor-Specific-Parameters MRCPv2 header.

Speech-Nomatch-Timeout

VAD spots that user stopped talking, but nothing matching the grammar was found

In milliseconds, an integer above 0. Default is 3000.

This parameter specifies the required length of silence after speech to declare end-of-speech in case none of the grammars matched what the user said.

Make sure this parameter's value is greater than Speech-Incomplete-Timeout's one.

Recognize Modes

Two recognition modes are proposed: normal mode and hotword. They differ mainly on how they match grammars and how they terminate. When a client issue a RECOGNIZE request, it must specify the recognition mode.

Normal mode

This is a free mode. Once a speech segment is detected, the ASR starts. It will begin to transcribe what was said by the caller and then analyze the content. The ASR will automatically stop after a speech utterance followed by silence of a given duration, or after a timeout if that parameter has been set (see header speech-complete-timeout, header speech-incomplete-timeout and the vendor-specifc header speech-nomatch-timeout, depending on the interpretation state).

There is 3 types of possible results in that mode:

  • Complete match: if the transcribed content match exactly what was expected.
  • Partial match: if the content only partially matches what was expected (example: a zip code is expected, but only 3 digits were spoken).
  • No-match: This can have several explanations:
    • The caller’s answer does not fit into the expected grammar used
    • If the recognition timeout has been reached without the caller having been able to express themselves

Hotword mode

This mode can only be used with "closed" grammars.

In this mode, we no longer wait for a given duration or simply for the caller to stop speaking in order to stop the ASR. When the caller speaks, if the ASR transcribes and completes the associated grammar (so as soon as a complete match is obtained), then it will stop immediately. This has several advantages:

  • Allowing more time for the caller to express themselves, if they need to, without stopping the ASR at each breath segment (example: a client starts to speak then pauses)
  • Improve the fluency of the dialogue by allowing the ASR to be automatically stopped once the grammar has been completed.

However, in this mode, there is no partial match possible, only the two following results:

  • If the caller has completed the grammar: complete match
  • If the caller has not completed the grammar or just partially within recognition-timeout or hotword-max-duration: no-match (example: a zip code is expected, only 3 digits were spoken = no-match).

In order not to disturb the good progress of the scenario, there are also two usable parameters:

  • Hotword-Max-Duration: maximum duration of the ASR, see documentation. A result will be returned even if the caller is still talking after this time.
  • Hotword-Min-Duration: minimum duration of ASR, see documentation.

User Experience and best practices

Barge-in

Barge-in describe the fact of letting end-user talk over the bot. It's useful for example if users are expected to be able to tell their answer before the bot has finished talking.

In order to propose barge-in:

Implementing barge-in means that recognition is started when the robot starts talking, so please be cautious regarding the following points, as it might deteriorate the User Experience:

  • As a result, recognition might be running for quite a long time in case the end-user is waiting for the end of the robot's prompt. This means an increase in ASR costs (up to 3 times compared to the same interaction without barge-in would not be a surprise).
  • If end-user use their phone on loudspeakers, recognition might catch the bot talking, and therefore disturb results.

Most users are not aware that the bot is listening while talking, so they may engage in other activities (including chatting, listening to the TV, Radio…) during this time and trigger the barge-in, callers from busy places may also be unable to use your service because of it. In the end, this could cause frustration to your user, and we recommend you to implement this feature into your SVI with care. In any case, you should inform them early about this feature if you implement it.