Output protocol

This page describes what response from server look like, and what clients might expect from the server after they've sent a command.

Output events

RECOGNITION-COMPLETE

An event from the server to the client following a RECOGNIZE command. This event indicates that the recognition is complete. Event's content includes the speech recognition result and its interpretation, depending on grammar.

Success recognition

This is an example of a successful recognition:

MRCP/2.0 544 RECOGNITION-COMPLETE 2 COMPLETE
Channel-Identifier: 39ac4ea9750a4790@speechrecog
Completion-Cause: 000 success
Completion-Reason: success
Content-Type: application/x-nlsml
Content-Length: 331

<?xml version="1.0" encoding="UTF-8"?>
<result version="1.25.0">
<interpretation grammar="session:demo-grammar-0" confidence="1.00">
<instance>je veux changer mon billet</instance>
<input mode="speech" timestamp-start="2020-12-22T11:11:45.620+01:00" timestamp-end="2020-12-22T11:11:47.060+01:00" confidence="1.00" asr-model="fr.basic">je veux changer mon billet</input>
</interpretation>
</result>
{
"event": "RECOGNITION-COMPLETE",
"request_id": 8,
"channel_id": "0x56432a7207d8",
"headers": {},
"completion_cause": "Success",
"completion_reason": "success",
"body": {
"asr": {
"transcript": "salut comment ça va",
"confidence": 0.842793,
"start": 1660899270646,
"end": 1660899271696
},
"nlu": {
"type": "builtin:speech/transcribe",
"value": "salut comment ça va",
"confidence": 0.842793
},
"grammar_uri": "session:transcribe",
"asr_model": "fr.basic",
"version": "1.25.0"
}
}

With MRCP, a set of headers is returned, completed with a body (quite similar to an HTTP response). With our WebSocket protocol, a JSON message is returned, and both headers and body equivalent are to be found in the same object.

Headers are described below. Body includes the raw transcription and the ASR confidence, within input tag for MRCP and asr object for WebSocket, as well as the interpretation and its own confidence, within instance tag for MRCP and nlu object for WebSocket.

Confidence of interpretation is always inferior or equal to confidence of ASR.

In case of N-Best-List-Length greater than 1, the event may contain several results, in the alternatives field if the WebSocket protocol is used, or as additional interpretation XML nodes if the MRCP protocol is used.

MRCP/2.0 1465 RECOGNITION-COMPLETE 2 COMPLETE
Channel-Identifier: 3d3f47339a9444f9@speechrecog
Completion-Cause: 000 success
Completion-Reason: success
Content-Type: application/x-nlsml
Content-Length: 1250

<?xml version="1.0" encoding="UTF-8"?>
<result version="1.38.0">
<interpretation grammar="session:demo-grammar-0" confidence="0.93">
<instance>xu820856250fr</instance>
<input mode="speech" timestamp-start="2024-11-25T15:07:45.552+00:00" timestamp-end="2024-11-25T15:07:51.072+00:00" confidence="0.94" asr-model="fr.epellation.v26">x u quatre-vingt-deux zéro quatre-vingt-cinq soixante-deux cinquante f r</input>
</interpretation>
<interpretation grammar="session:demo-grammar-0" confidence="0.86">
<instance>xu820857050fr</instance>
<input mode="speech" timestamp-start="2024-11-25T15:07:45.552+00:00" timestamp-end="2024-11-25T15:07:51.072+00:00" confidence="0.88" asr-model="fr.epellation.v26">x u quatre-vingt-deux zéro quatre-vingt-cinq soixante-dix cinquante f r</input>
</interpretation>
<interpretation grammar="session:demo-grammar-0" confidence="0.85">
<instance>xu820857850fr</instance>
<input mode="speech" timestamp-start="2024-11-25T15:07:45.552+00:00" timestamp-end="2024-11-25T15:07:51.072+00:00" confidence="0.87" asr-model="fr.epellation.v26">x u quatre-vingt-deux zéro quatre-vingt-cinq soixante-dix-huit cinquante f r</input>
</interpretation>
</result>
{
"event": "RECOGNITION-COMPLETE",
"request_id": 5,
"channel_id": "627be091490a5",
"headers":
{},
"completion_cause": "Success",
"completion_reason": "success",
"body":
{
"asr":
{
"transcript": "l e cent trente-sept cent trente-sept huit cent soixante-six c n",
"confidence": 0.967846,
"start": 1732547018818,
"end": 1732547024788
},
"nlu":
{
"type": "builtin:speech/spelling/mixed",
"value": "le137137866cn",
"confidence": 0.94947165
},
"grammar_uri": "session:parcel",
"asr_model": "fr.epellation.v26",
"version": "1.38.0",
"alternatives":
[
{
"asr":
{
"transcript": "l e cent trente-sept cent trente-sept huit cent soixante-six c m",
"confidence": 0.937297,
"start": 1732547018818,
"end": 1732547024788
},
"nlu":
{
"type": "builtin:speech/spelling/mixed",
"value": "le137137866cm",
"confidence": 0.90146655
},
"grammar_uri": "session:parcel",
"asr_model": "fr.epellation.v26",
"version": "1.38.0"
},
{
"asr":
{
"transcript": "e l e cent trente-sept cent trente-sept huit cent soixante-six c n",
"confidence": 0.888096,
"start": 1732547015908,
"end": 1732547024788
},
"nlu":
{
"type": "builtin:speech/spelling/mixed",
"value": "le137137866cn",
"confidence": 0.8321435
},
"grammar_uri": "session:parcel",
"asr_model": "fr.epellation.v26",
"version": "1.38.0"
}
]
}
}

No match

The recognition is complete, but input did not match any grammar, hence the completion-cause header value. For example:

MRCP/2.0 658 RECOGNITION-COMPLETE 11 COMPLETE
Channel-Identifier: 1f38218a1f9d13e1@speechrecog
Completion-Cause: 001 no-match
Completion-Reason: unable to match grammar
Content-Type: application/x-nlsml
Content-Length: 427

<?xml version="1.0" encoding="UTF-8"?>
<result version="1.25.0">
<interpretation confidence="0.94">
<instance/>
<input mode="speech" timestamp-start="2022-08-19T09:13:39.310+00:00" timestamp-end="2022-08-19T09:13:40.090+00:00" confidence="0.94" asr-model="fr-basic">
<nomatch>je n'ai rien</nomatch>
</input>
</interpretation>
</result>
{
"event": "RECOGNITION-COMPLETE",
"request_id": 8,
"channel_id": "0x564765bd0cd8",
"headers": {},
"completion_cause": "NoMatch",
"completion_reason": "unable to match grammar",
"body": {
"asr": {
"transcript": "je n'ai rien",
"confidence": 0.998894,
"start": 1660918888549,
"end": 1660918889209
},
"nlu": null,
"grammar_uri": "",
"version": "1.25.0",
"asr_model": "fr.basic"
}
}

No input

If user does not speak at all, or the sound level is not high enough to trigger the VAD, a No Input completion cause is returned. For example:

MRCP/2.0 365 RECOGNITION-COMPLETE 3 COMPLETE
Channel-Identifier: 8b9fca343f941dea@speechrecog
Completion-Cause: 002 no-input-timeout
Completion-Reason: no voice
Content-Type: application/x-nlsml
Content-Length: 142

<?xml version="1.0" encoding="UTF-8"?>
<result version="1.25.0">
<interpretation>
<instance/>
<input asr-model="en.basic"><noinput/></input>
</interpretation>
</result>
{
"event": "RECOGNITION-COMPLETE",
"request_id": 8,
"channel_id": "0x557de1c01d58",
"headers": {},
"completion_cause": "NoInputTimeout",
"completion_reason": "no voice",
"body": {
"asr": null,
"nlu": null,
"grammar_uri": "",
"version": "1.25.0",
"asr_model": "en.basic"
}
}

START-OF-INPUT

An event from the server to the client, indicating that the recognizer has detected speech, only in normal mode.

Receiving this event can be helpful in case of barge-in scenario, in order to stop the IVR/bot prompt.

GET-PARAMS

Allow user to get from the server the current parameters for the session. Headers list can be found on the Input documentation page.

Headers & statuses

The following sections present the more useful headers and how to interpret their value.

Completion cause

This is a header indicating the reason the recognition request completed. It is sent in DEFINE-GRAMMAR and RECOGNIZE responses.

success

If the input command was RECOGNIZE, it means that it completed with a match.

If the input command was DEFINE-GRAMMAR, it means the operation was succesful.

partial-match

In response to a RECOGNIZE command. Speech Incomplete Timeout expired before there was a full match. But whatever was spoken till that point was a partial match to one or more grammars. It can only happen for normal mode.

no-match

RECOGNIZE command was completed, but no match was found.

no-input-timeout

RECOGNIZE command completed without any speech detected before the no-input timers expired. The timeout is set with the No-Input-Timeout header.

no-match-maxtime

In response to a RECOGNIZE command. The Recognition-Timeout expired. Whatever was spoken till that point did not match any of the grammars. This cause could also be returned if the recognizer does not support detecting partial grammar matches.

success-maxtime

RECOGNIZE command terminated because speech was too long, and recognition-timeout timer has expired, but whatever was spoken till that point was a full match.

hotword-maxtime

RECOGNIZE command in hotword mode completed without a match due to a recognition-timeout. Either recognition-timeout or hotword-max-duration timers expired.

partial-match-maxtime

In response to a RECOGNIZE command. The Recognition-Timeout expired before full match was achieved. But whatever was spoken till that point was a partial match to one or more grammars.

grammar-load-failure

DEFINE-GRAMMAR

recognizer-error

In response to a RECOGNIZE command. Something went wrong on the server's side.

language-unsupported

In response to a RECOGNIZE command. The value of input header Speech-Language is invalid.

Completion reason

This header, in response to a RECOGNIZE command, is a more human friendly version of completion-cause header.

Other headers

Active-Request-Id-List

When client requests the server to STOP the recognition, the response will include this header. It will contain the request ID of the RECOGNIZE request that was actually stopped.