Output protocol
This page describes what response from server look like, and what clients might expect from the server after they've sent a command.
Output events
RECOGNITION-COMPLETE
An event from the server to the client following a RECOGNIZE
command. This event indicates that the recognition is complete. Event's content includes the speech recognition result and its interpretation, depending on grammar.
Success recognition
This is an example of a successful recognition:
MRCP/2.0 544 RECOGNITION-COMPLETE 2 COMPLETE
Channel-Identifier: 39ac4ea9750a4790@speechrecog
Completion-Cause: 000 success
Completion-Reason: success
Content-Type: application/x-nlsml
Content-Length: 331
<?xml version="1.0" encoding="UTF-8"?>
<result version="1.25.0">
<interpretation grammar="session:demo-grammar-0" confidence="1.00">
<instance>je veux changer mon billet</instance>
<input mode="speech" timestamp-start="2020-12-22T11:11:45.620+01:00" timestamp-end="2020-12-22T11:11:47.060+01:00" confidence="1.00" asr-model="fr.basic">je veux changer mon billet</input>
</interpretation>
</result>
{
"event": "RECOGNITION-COMPLETE",
"request_id": 8,
"channel_id": "0x56432a7207d8",
"headers": {},
"completion_cause": "Success",
"completion_reason": "success",
"body": {
"asr": {
"transcript": "salut comment ça va",
"confidence": 0.842793,
"start": 1660899270646,
"end": 1660899271696
},
"nlu": {
"type": "builtin:speech/transcribe",
"value": "salut comment ça va",
"confidence": 0.842793
},
"grammar_uri": "session:transcribe",
"asr_model": "fr.basic",
"version": "1.25.0"
}
}
With MRCP, a set of headers is returned, completed with a body (quite similar to an HTTP response). With our WebSocket protocol, a JSON message is returned, and both headers and body equivalent are to be found in the same object.
Headers are described below. Body includes the raw transcription and the ASR confidence, within input
tag for MRCP and asr
object for WebSocket, as well as the interpretation and its own confidence, within instance
tag for MRCP and nlu
object for WebSocket.
Confidence of interpretation is always inferior or equal to confidence of ASR.
In case of N-Best-List-Length
greater than 1, the event may contain several results, in the alternatives
field if the WebSocket protocol is used, or as additional interpretation XML nodes if the MRCP protocol is used.
MRCP/2.0 1465 RECOGNITION-COMPLETE 2 COMPLETE
Channel-Identifier: 3d3f47339a9444f9@speechrecog
Completion-Cause: 000 success
Completion-Reason: success
Content-Type: application/x-nlsml
Content-Length: 1250
<?xml version="1.0" encoding="UTF-8"?>
<result version="1.38.0">
<interpretation grammar="session:demo-grammar-0" confidence="0.93">
<instance>xu820856250fr</instance>
<input mode="speech" timestamp-start="2024-11-25T15:07:45.552+00:00" timestamp-end="2024-11-25T15:07:51.072+00:00" confidence="0.94" asr-model="fr.epellation.v26">x u quatre-vingt-deux zéro quatre-vingt-cinq soixante-deux cinquante f r</input>
</interpretation>
<interpretation grammar="session:demo-grammar-0" confidence="0.86">
<instance>xu820857050fr</instance>
<input mode="speech" timestamp-start="2024-11-25T15:07:45.552+00:00" timestamp-end="2024-11-25T15:07:51.072+00:00" confidence="0.88" asr-model="fr.epellation.v26">x u quatre-vingt-deux zéro quatre-vingt-cinq soixante-dix cinquante f r</input>
</interpretation>
<interpretation grammar="session:demo-grammar-0" confidence="0.85">
<instance>xu820857850fr</instance>
<input mode="speech" timestamp-start="2024-11-25T15:07:45.552+00:00" timestamp-end="2024-11-25T15:07:51.072+00:00" confidence="0.87" asr-model="fr.epellation.v26">x u quatre-vingt-deux zéro quatre-vingt-cinq soixante-dix-huit cinquante f r</input>
</interpretation>
</result>
{
"event": "RECOGNITION-COMPLETE",
"request_id": 5,
"channel_id": "627be091490a5",
"headers":
{},
"completion_cause": "Success",
"completion_reason": "success",
"body":
{
"asr":
{
"transcript": "l e cent trente-sept cent trente-sept huit cent soixante-six c n",
"confidence": 0.967846,
"start": 1732547018818,
"end": 1732547024788
},
"nlu":
{
"type": "builtin:speech/spelling/mixed",
"value": "le137137866cn",
"confidence": 0.94947165
},
"grammar_uri": "session:parcel",
"asr_model": "fr.epellation.v26",
"version": "1.38.0",
"alternatives":
[
{
"asr":
{
"transcript": "l e cent trente-sept cent trente-sept huit cent soixante-six c m",
"confidence": 0.937297,
"start": 1732547018818,
"end": 1732547024788
},
"nlu":
{
"type": "builtin:speech/spelling/mixed",
"value": "le137137866cm",
"confidence": 0.90146655
},
"grammar_uri": "session:parcel",
"asr_model": "fr.epellation.v26",
"version": "1.38.0"
},
{
"asr":
{
"transcript": "e l e cent trente-sept cent trente-sept huit cent soixante-six c n",
"confidence": 0.888096,
"start": 1732547015908,
"end": 1732547024788
},
"nlu":
{
"type": "builtin:speech/spelling/mixed",
"value": "le137137866cn",
"confidence": 0.8321435
},
"grammar_uri": "session:parcel",
"asr_model": "fr.epellation.v26",
"version": "1.38.0"
}
]
}
}
No match
The recognition is complete, but input did not match any grammar, hence the completion-cause
header value. For example:
MRCP/2.0 658 RECOGNITION-COMPLETE 11 COMPLETE
Channel-Identifier: 1f38218a1f9d13e1@speechrecog
Completion-Cause: 001 no-match
Completion-Reason: unable to match grammar
Content-Type: application/x-nlsml
Content-Length: 427
<?xml version="1.0" encoding="UTF-8"?>
<result version="1.25.0">
<interpretation confidence="0.94">
<instance/>
<input mode="speech" timestamp-start="2022-08-19T09:13:39.310+00:00" timestamp-end="2022-08-19T09:13:40.090+00:00" confidence="0.94" asr-model="fr-basic">
<nomatch>je n'ai rien</nomatch>
</input>
</interpretation>
</result>
{
"event": "RECOGNITION-COMPLETE",
"request_id": 8,
"channel_id": "0x564765bd0cd8",
"headers": {},
"completion_cause": "NoMatch",
"completion_reason": "unable to match grammar",
"body": {
"asr": {
"transcript": "je n'ai rien",
"confidence": 0.998894,
"start": 1660918888549,
"end": 1660918889209
},
"nlu": null,
"grammar_uri": "",
"version": "1.25.0",
"asr_model": "fr.basic"
}
}
No input
If user does not speak at all, or the sound level is not high enough to trigger the VAD, a No Input completion cause is returned. For example:
MRCP/2.0 365 RECOGNITION-COMPLETE 3 COMPLETE
Channel-Identifier: 8b9fca343f941dea@speechrecog
Completion-Cause: 002 no-input-timeout
Completion-Reason: no voice
Content-Type: application/x-nlsml
Content-Length: 142
<?xml version="1.0" encoding="UTF-8"?>
<result version="1.25.0">
<interpretation>
<instance/>
<input asr-model="en.basic"><noinput/></input>
</interpretation>
</result>
{
"event": "RECOGNITION-COMPLETE",
"request_id": 8,
"channel_id": "0x557de1c01d58",
"headers": {},
"completion_cause": "NoInputTimeout",
"completion_reason": "no voice",
"body": {
"asr": null,
"nlu": null,
"grammar_uri": "",
"version": "1.25.0",
"asr_model": "en.basic"
}
}
START-OF-INPUT
An event from the server to the client, indicating that the recognizer has detected speech, only in normal mode.
Receiving this event can be helpful in case of barge-in scenario, in order to stop the IVR/bot prompt.
GET-PARAMS
Allow user to get from the server the current parameters for the session. Headers list can be found on the Input documentation page.
Headers & statuses
The following sections present the more useful headers and how to interpret their value.
Completion cause
This is a header indicating the reason the recognition request completed. It is sent in DEFINE-GRAMMAR
and RECOGNIZE
responses.
success
If the input command was RECOGNIZE
, it means that it completed with a match.
If the input command was DEFINE-GRAMMAR
, it means the operation was succesful.
partial-match
In response to a RECOGNIZE
command. Speech Incomplete Timeout expired before there was a full match. But whatever was spoken till that point was a partial match to one or more grammars. It can only happen for normal mode.
no-match
RECOGNIZE
command was completed, but no match was found.
no-input-timeout
RECOGNIZE
command completed without any speech detected before the no-input
timers expired. The timeout is set with the No-Input-Timeout
header.
no-match-maxtime
In response to a RECOGNIZE
command. The Recognition-Timeout expired. Whatever was spoken till that point did not match any of the grammars. This cause could also be returned if the recognizer does not support detecting partial grammar matches.
success-maxtime
RECOGNIZE
command terminated because speech was too long, and recognition-timeout
timer has expired, but whatever was spoken till that point was a full match.
hotword-maxtime
RECOGNIZE
command in hotword mode completed without a match due to a recognition-timeout. Either recognition-timeout
or hotword-max-duration
timers expired.
partial-match-maxtime
In response to a RECOGNIZE
command. The Recognition-Timeout expired before full match was achieved. But whatever was spoken till that point was a partial match to one or more grammars.
grammar-load-failure
DEFINE-GRAMMAR
recognizer-error
In response to a RECOGNIZE
command. Something went wrong on the server's side.
language-unsupported
In response to a RECOGNIZE
command. The value of input header Speech-Language
is invalid.
Completion reason
This header, in response to a RECOGNIZE
command, is a more human friendly version of completion-cause
header.
Other headers
Active-Request-Id-List
When client requests the server to STOP
the recognition, the response will include this header. It will contain the request ID of the RECOGNIZE
request that was actually stopped.