Getting started

We only support the MRCP recognizer service as a gateway to our ASR for users who need to leverage their legacy MRCPv2 tools.
Please keep in mind that this gateway only implements a sub-set of the MRCPv2 specifications that are compatible with our native ASR API: this isn't a port of our ASR engine to cover all of MRCPv2, in any way.

Input protocol

Commands

  • SET_PARAMS
  • GET_PARAMS
  • DEFINE-GRAMMAR (Only builtin grammars are supported, not custom grammars sent as SRGS XML)
  • RECOGNIZE (multiple active grammars supported without weights)
  • STOP
  • START_INPUT_TIMERS

Resource headers

  • Confidence-Threshold
  • No-Input-Timeout
  • Recognition-Timeout
  • Start-Input-Timers
  • Speech-Complete-Timeout
  • Speech-Incomplete-Timeout
  • Content-Type
  • Content-ID
  • Sensitivity-Level (ignored since version 1.6.1)
  • Speech-Language (2-letter ISO codes, only "fr" and "en" for the moment)
  • Hotword-Min-Duration (RECOGNIZE method only)
  • Hotword-Max-Duration (RECOGNIZE method only)
  • Recognition-Mode (RECOGNIZE method only)

Extensions

We provide some extended parameters that can be specified in the Vendor-Specific-Parameters MRCPv2 header:

  • Speech-Nomatch-Timeout: this parameter specifies the required length (in milliseconds) of silence after speech to declare end-of-speech in case none of the grammars matched what the user said. The default value is 5000ms.

Recognize Modes

Normal mode

This is a free mode. Once a speech segment is detected, the ASR starts. It will begin to transcribe what was said by the caller and then analyze the content. The ASR will automatically stop after a breath segment not followed by any speech segment (or after a timeout if that parameter has been set).

There is 3 types of possible results in that mode:

  • Complete match: if the transcribed content match exactly what was expected.
  • Partial match: if the content only partially matches what was expected (example: a zip code is expected, but only 3 digits were spoken).
  • Nomatch: This can have several explanations:
    • The caller’s answer does not fit into the expected framework of a grammar used
    • If the timeout has been reached without the caller having been able to express themself

Hotword mode

This mode can only be used with "closed" grammars.

We speak of a closed grammar when the possible set of results is finite. This concerns the following grammars: spelling + length, spelling + regex, boolean, keyword, zipcode.

In this mode, we no longer wait for a given duration or simply for the caller to stop speaking in order to stop the ASR. When the caller speaks, if the ASR transcribes and completes the associated grammar (so when a complete match is obtained), then it will stop immedialtely. This has several advantages:

  • Allowing more time for the caller to express themself, if they need to, without stopping the ASR at each breath segment (example: a client starts to speak then pauses)
  • Improve the fluency of the dialogue by allowing the ASR to be automatically stopped once the grammar has been completed.

However, in this mode, there is no partial-match possible, only the two following results:

  • If the caller has completed the grammar: complete-match
  • If the caller has not completed the grammar or just partially within recognition-timeout or hotword-max-duration: nomatch (example: a zip code is expected, only 3 digits were spoken = no-match).

In order not to disturb the good progress of the scenario, there are also two usable parameters:

  • Hotword-Max-Duration: maximum duration of the ASR. A result will be returned even if the caller is still talking after this time.
  • Hotword-Min-Duration: minimum duration of ASR.

Output protocol

Output sample

MRCP/2.0 544 RECOGNITION-COMPLETE 2 COMPLETE
Channel-Identifier: 39ac4ea9750a4790@speechrecog
Completion-Cause: 000 success
Completion-Reason: success
Content-Type: application/x-nlsml
Content-Length: 331
<?xml version="1.0" encoding="UTF-8"?>
<result>
<interpretation grammar="session:demo-grammar-0" confidence="1.00">
<instance>je veux changer mon billet</instance>
<input mode="speech" timestamp-start="2020-12-22T11:11:45.620+01:00" timestamp-end="2020-12-22T11:11:47.060+01:00" confidence="1.00">je veux changer mon billet</input>
</interpretation>
</result>

Interpretation score is provided with attribute confidence on both <interpretation> and <output>.

Output will contain no-match or no-input depending on the audio stream, and the recognize parameters, for example:

<result>
<interpretation grammar="session:demo-grammar-0" confidence="1.00">
<instance/>
<input mode="speech" confidence="1.00"><nomatch>alors t n c</nomatch></input>
</interpretation>
</result>
<result>
<interpretation>
<instance/>
<input><noinput/></input>
</interpretation>
</result>

Events

  • START-OF-INPUT
  • RECOGNITION-COMPLETE

Headers & statuses

  • Completion-Cause:

    • success
    • no-input-timeout
    • partial-match
    • no-match-maxtime
    • no-match
    • success-maxtime,
    • hotword-maxtime,
    • partial-match-maxtime
  • Completion-Reason

For GET-PARAMS replies:

  • Confidence-Threshold
  • No-Input-Timeout
  • Recognition-Timeout
  • Start-Input-Timers
  • Speech-Complete-Timeout
  • Speech-Incomplete-Timeout
  • Sensitivity-Level
  • Speech-Language

For STOP replies:

  • Active-Request-Id-List