Protocol reference

Protocol flow overview

  • WS Handshake: the client initiate the connection, passing its authentication token (JWT) in the Authorization HTTP Header.
  • The server verifies the token and grants access if the token is valid or refuses.

Once the client is granted access, it can run several successive sessions without disconnecting and reconnecting — quota management is left out of this blueprint as it is a wider concern that just this service.

A session is a context to which attributes (called "session" attributes) can be attached and a unit of work that scopes "global" or defaults recognition parameters the client may set and in which recognition requests are issued. In general, we recommend that a session matches a human-bot interview or "conversation" (e.g. a phone call): the session is started at the beginning of the phone call and stopped at hang up.

Contrary to the conversation based API, only one speaker audio is streaming during a session and the ASR doesn't run continuously. The ASR+NLU is started by a recognition request when the client needs it, and it stops when the expected information is found or when some timer expires. The client describes the information it expects by one or more grammars. A grammar is a kind of preset identified by an URI that will setup both our ASR and NLU engines for the task. Some of those presets can be further customized by the client by passing parameters to them in a "query" string.

The client can issue several recognition requests during a session, but at any given time, at most one request may be active. You cannot have concurrent recognition requests running in parallel. If you want to try several possible interpretations at the same time, you would use one request with several grammars; see recognition request below.

  • The client opens a new session, specifying session attributes that will then appear on invoices. Session attributes are useful for the client's accounting purposes.
  • Once a session is opened, the client can start streaming audio. No transcription will occur yet (and hence, no fees!).
  • While streaming, the client can send different commands (in the same connection) to the server:
    • SET-PARAMS — set default recognition parameters that will be valid for the rest of the session or until another SET-PARAMS command replaces them;
    • DEFINE-GRAMMAR — define a convenient alias for a grammar, valid for the rest of the session (there is no point in redefining an existing alias);
    • RECOGNIZE — start the ASR and interpretation process; the command can temporarily override any parameter previously set with SET-PARAMS; the recognition parameters (including the chosen grammars) define precisely how and when the recognition stops;
    • START-INPUT-TIMERS — among the termination conditions are timeouts; if the RECOGNIZE didn't start the timers already, the client can start them later with this command;
    • STOP — stop the ongoing RECOGNIZE process before it is due to terminate;
    • GET-PARAMS — get the current session parameter values.
  • The client can also close the session. It must stop streaming then.

The server respond to each message with a success or error message.

Any audio packet sent outside of a session is ignored instead of raising an error because in practice the client won't always be able to perfectly synchronize the audio and the commands (this was noticed in real life with MRCP). Indeed, a client typical architecture involves different threads or subsystems for bot orchestration and audio streaming.

Besides the responses to the requests, the server emits some events as the recognition process progresses:

  • START-OF-INPUT — in normal mode (see below), the server emits this event when voice is first detected;
  • RECOGNITION-COMPLETE — when the recognition process is complete, whether it succeeded or failed.

Recognition modes

Two recognition modes are proposed: normal mode and hotword. They differ mainly on how they match grammars and how they terminate. When a client issue a RECOGNIZE request, it must specify the recognition mode.

The semantics are the same as those of MRCP.

Messages and events

Here are the precise definitions of the different messages and events.

Except the audio packets that are sent as binary WS messages, all commands from the client (requests) are sent as WS text message containing UTF-8 encoded JSON objects in the form:

{
"command": "some-command-name",
"request_id": 0,
"channel_id": "uie46e4ui6",
"headers": {},
"body": "unicode text"
}

where:

FieldTypePossible ValuesDescription
commandstrsee aboveThe command name
request_idintmonotonic counterunique request identifier set by the client for reference by responses or events from the server
channel_idstrunique identifiersession identifier set by the server and repeated by the client
headersmapcommand dependentcommand specific parameters
bodytextcommand dependentcommand specific payload

The command used to open a session doesn't need to provide a channel_id since it is not set yet. If that command provides a value in that field, it will be used as a prefix for the actual channel_id returned by the server. The channel_id doesn't change during the session.

The channel_id is only useful for debugging now, as the couple channel_id + request_id uniquely identifies a request (and its follow-up responses) across sessions and clients. In the future, it may be used to multiplex several parallel session in the same connection.

All responses to commands or events are also text WS messages containing UTF-8 encoded JSON objects in the form:

{
"event": "some-event-name",
"request_id": 0,
"channel_id": "uie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": {}
}

where:

FieldTypePossible ValuesDescription
eventstrsee for each command belowevent name
request_idintany request_id already received from the clientreference to the request this event responds to
channel_idstrunique identifiersession identifier set by the server
completion_causestr / nullevent dependentoptional: complementary status of event
completion_reasonstr / nullfree-form explanation message
headersmapevent dependentevent specific attributes
bodyobjectevent dependentevent payload

In the definitions below, we won't repeat the request_id or channel_id and empty fields are omitted.

Open a session

  • command: OPEN
  • headers:
    • custom_id: string, freely set by the client (to identify it's own client for example)

The custom_id sent in the header is reproduced unchanged on usage reports.

In the future we may add more session attributes like, for example, the audio codec.

Success response

  • event: OPENED
  • channel_id: string, the identifier given to this session by the server.

Error responses

  • event = METHOD-NOT-VALID when a session is already opened (this event has no channel_id)
  • event = INVALID-PARAM-VALUE in case of JSON schema error (this event has no request_id, no channel_id)
  • event = METHOD-FAILED for other errors with:
    • completion_cause = Error
    • completion_reason = the actual error explanation

In case of a METHOD-FAILED response, the channel must still be closed by the client.

Stream audio

Audio samples are sent as WS binary messages. Unless a codec is set when opening a session (in future iterations), the audio must be in raw PCM format (no headers, no attributes, just raw audio): frames must be 16 bits signed little endian integers sampled at 8khz, mono only.

For efficiency on WS, the samples should be at least 50 milliseconds long, i.e. 400 frames or 800 bytes (but less than 100 ms to keep latency low). So if you are converting an RTP stream which transmits 10 to 20 ms packets you'll have to buffer them — or better yet, use the MRCP API instead of this WS API. That's because WS (TCP only) has more overhead than datagram based protocols.

The server won't return any acknowledgement upon receiving audio packets.

If you send truncated frames, i.e. an odd number of bytes, the server will close the session:

  • event: CLOSED
  • completion_cause: Error
  • completion_reason: truncated frame in audio packet

Set session defaults

Set session default values for the recognition parameters.

This command is optional. Use it if you don't want to set and repeat the recognition parameters on each individual RECOGNIZE request. It's a matter of taste and client implementation.

  • command: SET-PARAMS
  • headers are any subset (all optional) of:
FieldTypePossible Values or unitDescription
no_input_timeoutintmillisecondsWhen recognition is started and there is no speech detected for this period of time, the recognizer can send a RECOGNITION-COMPLETE event to the client with a completion_cause of NoInputTimeout
speech_complete_timeoutint(normal mode only) msThe speech-complete-timeout value applies when the recognizer currently has a complete match against an active grammar, and specifies how long the recognizer MUST wait for more input before declaring a match.
speech_incomplete_timeoutint(normal mode only) msThe incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars. In this case, once the timeout is triggered, the partial result is returned with a completion_cause of PartialMatch.
speech_nomatch_timeoutint(normal mode only) msThe nomatch timeout applies when the speech prior to the silence doesn't match any of the active grammars. In this case, once the timeout is triggered, the transcript speech input is returned without interpretation and with a completion_cause of NoMatch.
hotword_min_durationint(hotword only) msIt specifies the minimum duration of an utterance that will be considered for hotword recognition
hotword_max_durationint(hotword only) msIt specifies the maximum duration of an utterance that will be considered for hotword recognition
recognition_timeoutintmswhen recognition is started and there is no match for a certain period of time, the recognizer can send a RECOGNITION-COMPLETE event to the client and terminate the recognition operation
confidence_thresholdfloat0 ≤ l ≤ 1this field tells the recognizer resource what confidence level the client considers a successful match
speech_languagestrRFC 1766we only support fr, fr-FR, en, en-US, en-GB for the moment

Unknown headers are just ignored without error.

In hotword mode, there is a subtle difference between recognition_timeout and hotword_max_duration: The recognition_timeout timer is reset on each silence whereas the hotword_max_duration timer is not.

In Normal mode, the recognition_timeout timer is not reset on silences!

Success response

  • event: PARAMS-SET

Error responses

  • event: METHOD-FAILED

    • completion_cause: LanguageUnsupported
    • completion_reason contains details
  • event: INVALID-PARAM-VALUE

    • we were unable to parse the JSON payload (wrong types or syntax error)
    • completion_cause: Error,
    • completion_reason: detailed error message
    • as the command was not decoded, the request_id header of this response may be 0, and the channel_id could be empty.

Why two different events? In the first case, the value (language tag) is valid but not (currently) supported vs invalid values that will never be correct. For exemple "speech_language": "ar‑SA" would raise a METHOD-FAILED but "speech_language": 78.6 would raise an INVALID-PARAM-VALUE.

Get session defaults

Get session default values of the recognition parameters.

  • command GET-PARAMS

Success response

  • event: DEFAULT-PARAMS

Error responses

None, this command never fails.

Define a grammar alias

  • command: DEFINE-GRAMMAR
  • headers:
FieldTypePossible Values or unitDescription
content_idstrASCII stringthe alias id you want to define (without the session: prefix)
content_typestr"text/uri-list"The type on the body. Only "text/uri-list" is supported for the moment
  • body: the grammar you want to alias. Examples:
    • builtin:speech/address
    • builtin:speech/keywords?alternatives=facture|commande|compte|conseiller
    • builtin:speech/spelling/mixed?regex=[a-z]{2}[0-9]{9}[a-z]{2}

Success response

  • event: GRAMMAR-DEFINED

Error responses

  • event: METHOD_FAILED

    • completion_cause: GramDefinitionFailure or GramLoadFailure
    • completion_reason: the explanatory error message
  • event: METHOD_NOT_VALID if a recognition process is in progress

  • event: MISSING_PARAM if content-id is missing

Start recognition

  • command: RECOGNIZE
  • headers: the same as SET-PARAMS plus:
FieldTypePossible Values or unitDescription
recognition_modestr"normal" / "hotword"the recognition mode
start_input_timersboola value of false tells the recognizer to start recognition but not to start the no-input timer yet. Default is false.
content_typestr"text/uri-list"The type on the body. Only "text/uri-list" is supported for the moment

If the client chooses not to start the timers immediately, it should issue a START-INPUT-TIMERS command later.

  • body (required): multi-line string, contains the grammar to use, one per line (order is significant for priorities).

The grammar references may be builtin grammars (builtin:…) or aliases prefixed by session:.

If more than one grammar is given, the server will try to match them all at the same time. The first matching grammar "wins". If several grammars match at once, the one that is earlier in the list has priority.

Future recognition events will reference this request in their request_id.

Success response

  • event: RECOGNITION-IN-PROGRESS

Error responses

  • event: METHOD_FAILED
    • completion_cause: GramLoadFailure or Error (for ASR errors or when a recognition request is already progressing)
    • completion_reason: the explanatory error message

Asynchronous Recognition events

"start of input" event

In normal mode only, when the recognition process starts hearing a voice, this event is fired:

  • event: START-OF-INPUT

Outcome event

Some time later, when the recognition completes, a RECOGNITION-COMPLETE event is fired, with a completion_cause header that may be one of:

  • Success: at least one of the grammar matched
  • NoInputTimeout: no voice was heard until no_input_timeout expired
  • NoMatch: in normal mode, what was "heard" contradicts all grammars or confidence was too low to accept match
  • NoMatchMaxtime: no match was found before recognition_timeout was reached in normal mode
  • HotwordMaxtime: no match was found before recognition_timeout or hotword_max_duration was reached in hotword mode
  • TooMuchSpeechTimeout: a match was found, but new speech was still matching when recognition_timeout or hotword_max_duration expired (the match is returned)
  • PartialMatch: (normal mode only) only a partial match was found
  • PartialMatchMaxtime: (normal mode only) only a partial was found and new speech continued to partial match the grammar until recognition_timeout or hotword_max_duration expired (the partial match is returned)

The completion_reason header may give additional justification.

The body contains the Recognition Result JSON object.

Start input timers

  • command: START-INPUT-TIMERS

Success response

  • event: INPUT-TIMERS-STARTED

Error responses

None

Stop ongoing recognition

  • command: STOP

Success response

  • event: STOPPED
  • headers:
    • active_request_id: the id of the RECOGNIZE request that was cancelled

Error responses

If there is no recognition process in progress, the command is simply ignored.

Close session

  • command: CLOSE

Success response

  • event: CLOSED

Error responses

  • event: METHOD_INVALID if no session was open.

Grammars

We support the following builtin grammars:

  • builtin:grammar/none
  • builtin:speech/none
  • builtin:speech/address
  • builtin:speech/address?struct: return a structured address as XML
  • builtin:/speech/boolean to match whether the speaker agrees or not. The interpretation returns "yes" or "no".
  • builtin:speech/transcribe
  • builtin:speech/text2num
  • builtin:speech/spelling/mixed
  • builtin:speech/spelling/digits
  • builtin:speech/spelling/letters
  • builtin:speech/spelling/mixed?regex= + pattern (the interpretation returns the match as a single word) — partial matches are not supported
  • builtin:speech/spelling/digits?regex= + pattern (the interpretation returns the match as a single word) — partial matches are not supported
  • builtin:speech/spelling/letters?regex= + pattern (the interpretation returns the match as a single word) — partial matches are not supported
  • builtin:speech/spelling/mixed?length= + integer (forces interpretation as a single word of the given length)
  • builtin:speech/spelling/digits?length= + integer (forces interpretation as a single word of the given length)
  • builtin:speech/spelling/letters?length= + integer (forces interpretation as a single word of the given length)
  • builtin:speech/spelling/zipcode
  • builtin:speech/zipcode (alias to builtin:speech/spelling/zipcode) — beware that this builtin is not universal, it only recognize 5 digit zipcodes that are splittable as 2+3 digits. We recommend you use builtin:speech/spelling/digits?length= + integer for such applications instead.
  • builtins:/speech/keywords= + <alternatives> where <alternative> is a list of words or expressions separated by |. The grammar matches when one of the listed keyword is found, and the returned interpretation is the found keyword.

You'll find more details on the MRCP page related to grammars.

See the MRCP Recognizer documentation. Instead of XML we return a recognition result object:

Recognition result

FieldTypePossible Values or unitDescription
asrobjectTranscript objectThe result of the speech to text transcription
nluobjectInterpretation objectInterpreted result (what the engine understood) as structured data
grammar_uristrgrammar URIthe grammar that matched (as specified in the RECOGNIZE request)

Depending of the completion_cause, some or all fields may be empty (null or "").

Transcript

FieldTypePossible Values or unitDescription
transcriptstrUTF-8The raw ASR transcript
confidencefloat0 ≤ c ≤ 1The ASR transcript confidence
startintunix timestamp in msthe start of the transcript timestamp
endintunix timestamp in msthe end of the transcript timestamp

Interpretation

FieldTypePossible Values or unitDescription
typeURIbuiltin Grammar URIThe actual buitin grammar that matched (once all aliases were resolved, without the query part)
valuegrammar dependentgrammar dependentthe actual semantic interpretation
confidencefloat0 ≤ c ≤ 1The interpretation confidence

All currently available grammars return a string as value except:

  • the boolean grammar that returns a boolean value;
  • the address grammar that returns an address object:
{
"type": "builtin:speech/address",
"value":
{
"number": "37",
"street": "rue du docteur leroy",
"zipcode": "72000",
"city": "le mans"
},
"confidence": 0.9
}

In any case, the client should check the type to know how to handle the interpretation, even if it is a plain string.

Complete example

The connection is already granted.

Session initiation

A new user is calling the bot, so the bot opens a new Samosa session and sends:

{
"command": "OPEN",
"request_id": 0,
"channel_id": "test",
"headers": {
"custom_id": "blueprint"
},
"body": ""
}

the command is successful and the server reponds

{
"event": "OPENED",
"request_id": 0,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}

The client starts streaming. Audio packet are omitted here.

Prepare the grammar (the NLU) to interpret the first user answer

The bot expects a French car plate number. For clarity and re-usability, the bot developers define an alias for this, creating a custom grammar, called immat, out of the builtin builtin:speech/spelling/mixed grammar configured with a custom REGEX telling the NLU what a French car plate looks like:

{
"command": "DEFINE-GRAMMAR",
"request_id": 1,
"channel_id": "testuie46e4ui6",
"headers": {
"content_id": "immat",
"content_type": "text/uri-list"
},
"body": "builtin:speech/spelling/mixed?regex=([a-z]{2}[0-9]{3}[a-z]{2})|([0-9]{4}[a-z]{3}[0-9]{2})"
}

The command is successful and the server confirms:

{
"event": "GRAMMAR-DEFINED",
"request_id": 1,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}

Set some defaults values

To keep the recognition command "light", the bot developers set the recognition language and confidence threshold once for the duration of the session:

{
"command": "SET-PARAMS",
"request_id": 2,
"channel_id": "testuie46e4ui6",
"headers": {
"speech_language": "fr",
"confidence_threshold": 0.7
},
"body": ""
}

The command is successful, the server responds

{
"event": "PARAMS-SET",
"request_id": 2,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}

Start the recognition

The bot asks the user to spell their car plate number and then instructs the Samosa session to listen to the user and interpret what they are saying:

{
"command": "RECOGNIZE",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"headers": {
"recognition_mode": "normal",
"no_input_timout": 5000,
"recognition_timeout": 30000,
"speech_complete_timeout": 800,
"speech_incomplete_timeout": 1500,
"speech_nomatch_timeout": 3000,
"content_type": "text/uri-list"
},
"body": "session:immat"
}

Instead of session:immat, the bot devs could have decided not to define the alias and to directly use the builtin:speech/spelling/mixed?regex=([a-z]{2}[0-9]{3}[a-z]{2})|([0-9]{4}[a-z]{3}[0-9]{2}) expression here.

The process is successfully started and the server responds:

{
"event": "RECOGNITION-IN-PROGRESS",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"completion_cause": "Success",
"completion_reason": null,
"headers": {},
"body": ""
}

Receive results

The bots now waits for more events from the server, while the user is speaking — hopefully spelling their car plate number, among other things (the NLU is quite robust).

As we are using normal mode, as soon as the user's voice is detected, the server sends the START-OF-INPUT event:

{
"event": "START-OF-INPUT",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}

Sometime later, when the user is done speaking, the server returns what it has understood (in this case, a success):

{
"event": "RECOGNITION-COMPLETE",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"completion_cause": "Success",
"completion_reason": null,
"headers": {},
"body": {
"asr": {
"transcript": "attendez alors voilà baissé trois cent cinq f z",
"confidence": 0.9,
"start": 1629453934909,
"end": 1629453944833
},
"nlu": {
"type": "builin:speech/spelling/mixed",
"value": "bc305fz",
"confidence": 0.86
},
"grammar_uri": "session:immat"
}
}