WebSocket for voicebots

All int type are unsigned 64 bit integers.

Messages and events

Here are the precise definitions of the different messages and events.

Except the audio packets that are sent as binary WS messages, all commands from the client (requests) are sent as WS text messages containing UTF-8 encoded JSON objects in the form:

{
"command": "some-command-name",
"request_id": 0,
"channel_id": "uie46e4ui6",
"headers": {},
"body": "unicode text"
}

where:

FieldTypePossible ValuesDescription
commandstrsee overviewThe command name
request_idintmonotonic counterunique request identifier set by the client for reference by responses or events from the server
channel_idstrunique identifiersession identifier set by the server and repeated by the client
headersmapcommand dependentcommand specific parameters
bodytextcommand dependentcommand specific payload

The command used to open a session doesn't need to provide a channel_id since it is not set yet. If that command provides a value in that field, it will be used as a prefix for the actual channel_id returned by the server. The channel_id doesn't change during the session.

The channel_id is only useful for debugging now, as the couple channel_id + request_id uniquely identifies a request (and its follow-up responses) across sessions and clients. In the future, it may be used to multiplex several parallel session in the same connection.

All responses to commands or events are also text WS messages containing UTF-8 encoded JSON objects in the form:

{
"event": "some-event-name",
"request_id": 0,
"channel_id": "uie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": {}
}

where:

FieldTypePossible ValuesDescription
eventstrsee for each command belowevent name
request_idintany request_id already received from the clientreference to the request this event responds to
channel_idstrunique identifiersession identifier set by the server
completion_causestr / nullevent dependentoptional: complementary status of event
completion_reasonstr / nullfree-form explanation message
headersmapevent dependentevent specific attributes
bodyobjectevent dependentevent payload

In the definitions below, we won't repeat the request_id or channel_id and empty fields are omitted.

Open a session

  • command: OPEN
  • headers:
    • custom_id: string, optional, freely set by the client (to identify its own customer for example)
    • session_id: string, optional, freely set by the client to identify the session on its side, an that will be visible in the server logs an in the Dev Console.
    • audio_codec: string, to set the codec of the streamed audio:
      • "linear": raw PCM 16 bit signed little endian samples at 8khz rate;
      • "g711a": G711 a-law at 8khz sample rate;
      • "g711u": G711 μ-law at 8khz sample rate.

The custom_id sent in the header is reproduced unchanged on usage reports and invoices.

Success response

  • event: OPENED
  • channel_id: string, the identifier given to this session by the server.

Error responses

  • event = METHOD-NOT-VALID when a session is already opened (this event has no channel_id)
  • event = INVALID-PARAM-VALUE in case of JSON schema error (this event has no request_id, no channel_id)
  • event = METHOD-FAILED for other errors with:
    • completion_cause = Error
    • completion_reason = the actual error explanation

In case of a METHOD-FAILED response, the channel must still be closed by the client.

All opened sessions must be closed properly.

Close a session

  • command: CLOSE

Response

  • event: CLOSED

After a session is closed, you can either start another session in the same websocket connection or close the websocket connection.

If you choose to close the websocket connection, the client should initiate the websocket close handshake and wait for the server to close the connection.

Stream audio

Audio samples are sent as WS binary messages. Unless a codec is set when opening a session (in future iterations), the audio must be in raw PCM format (no headers, no attributes, just raw audio): frames must be 16 bits signed little endian integers sampled at 8khz, mono only.

For efficiency on WS, the samples should be at least 50 milliseconds long, i.e. 400 frames or 800 bytes (in linear, 400 bytes in G711), but less than 100 ms to keep latency low. So if you are converting an RTP stream which transmits 10 to 20 ms packets you'll have to buffer them — or better yet, use the MRCP API instead of this WS API. That's because WS (TCP only) has more overhead than datagram based protocols.

The server won't return any acknowledgement upon receiving audio packets.

If you send truncated frames, i.e. an odd number of bytes, the server will close the session:

  • event: CLOSED
  • completion_cause: Error
  • completion_reason: truncated frame in audio packet

Set session defaults

Set session default values for the recognition parameters.

This command is optional. Use it if you don't want to set and repeat the recognition parameters on each individual RECOGNIZE request. It's a matter of taste and client implementation.

  • command: SET-PARAMS
  • headers are any subset (all optional) of:
FieldTypePossible Values or unitDescription
no_input_timeoutintmillisecondsWhen recognition is started and there is no speech detected for this period of time, the recognizer can send a RECOGNITION-COMPLETE event to the client with a completion_cause of NoInputTimeout
speech_complete_timeoutint(normal mode only) msThe speech-complete-timeout value applies when the recognizer currently has a complete match against an active grammar, and specifies how long the recognizer MUST wait for more input before declaring a match.
speech_incomplete_timeoutint(normal mode only) msThe incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars. In this case, once the timeout is triggered, the partial result is returned with a completion_cause of PartialMatch.
speech_nomatch_timeoutint(normal mode only) msThe nomatch timeout applies when the speech prior to the silence doesn't match any of the active grammars. In this case, once the timeout is triggered, the transcript speech input is returned without interpretation and with a completion_cause of NoMatch.
hotword_min_durationint(hotword only) msIt specifies the minimum duration of an utterance that will be considered for hotword recognition
hotword_max_durationint(hotword only) msIt specifies the maximum duration of an utterance that will be considered for hotword recognition
recognition_timeoutintmswhen recognition is started and there is no match for a certain period of time, the recognizer can send a RECOGNITION-COMPLETE event to the client and terminate the recognition operation
confidence_thresholdfloat0 ≤ l ≤ 1this field tells the recognizer resource what confidence level the client considers a successful match (default is 0.5)
sensitivity_levelfloat0 ≤ l ≤ 1this field sets the Voice Activity Detection (VAD) sensitivity (default is 0.5)
speech_languagestrRFC 5646we only support fr, fr-FR, en, en-US, en-GB for the moment, plus some business domain extensions
logging_tagstranyThis string will be used to tag all the interaction following the SET-PARAMS command, so they can be tracked in the logs and Dev Console

Unknown headers are just ignored without error.

In hotword mode, there is a subtle difference between recognition_timeout and hotword_max_duration: The recognition_timeout timer is reset on each silence whereas the hotword_max_duration timer is not.

In Normal mode, the recognition_timeout timer is not reset on silences!

Success response

  • event: PARAMS-SET

Error responses

  • event: METHOD-FAILED

    • completion_cause: LanguageUnsupported
    • completion_reason contains details
  • event: INVALID-PARAM-VALUE

    • we were unable to parse the JSON payload (wrong types or syntax error)
    • completion_cause: Error,
    • completion_reason: detailed error message
    • as the command was not decoded, the request_id header of this response may be 0, and the channel_id could be empty.

Why two different events? In the first case, the value (language tag) is valid but not (currently) supported vs invalid values that will never be correct. For exemple "speech_language": "ar‑SA" would raise a METHOD-FAILED but "speech_language": 78.6 would raise an INVALID-PARAM-VALUE.

Get session defaults

Get session default values of the recognition parameters.

  • command GET-PARAMS

Success response

  • event: DEFAULT-PARAMS

Error responses

None, this command never fails.

Define a grammar alias

  • command: DEFINE-GRAMMAR
  • headers:
FieldTypePossible Values or unitDescription
content_idstrASCII stringthe alias id you want to define (without the session: prefix)
content_typestr"text/uri-list"The type on the body. Only "text/uri-list" is supported for the moment
  • body: the grammar you want to alias. Examples:
    • builtin:speech/address
    • builtin:speech/keywords?alternatives=facture|commande|compte|conseiller
    • builtin:speech/spelling/mixed?regex=[a-z]{2}[0-9]{9}[a-z]{2}

Success response

  • event: GRAMMAR-DEFINED

Error responses

  • event: METHOD_FAILED

    • completion_cause: GramDefinitionFailure or GramLoadFailure
    • completion_reason: the explanatory error message
  • event: METHOD_NOT_VALID if a recognition process is in progress

  • event: MISSING_PARAM if content-id is missing

Start recognition

  • command: RECOGNIZE
  • headers: the same as SET-PARAMS, except Logging-Tag, plus:
FieldTypePossible Values or unitDescription
recognition_modestr"normal" / "hotword"the recognition mode
start_input_timersboola value of false tells the recognizer to start recognition but not to start the no-input timer yet. Default is false.
content_typestr"text/uri-list"The type on the body. Only "text/uri-list" is supported for the moment

If the client chooses not to start the timers immediately, it should issue a START-INPUT-TIMERS command later.

  • body (required): multi-line string, contains the grammar to use, one per line (order is significant for priorities).

The grammar references may be builtin grammars (builtin:…) or aliases prefixed by session:.

If more than one grammar is given, the server will try to match them all at the same time. The first matching grammar "wins". If several grammars match at once, the one that is earlier in the list has priority.

Future recognition events will reference this request in their request_id.

Success response

  • event: RECOGNITION-IN-PROGRESS

Error responses

  • event: METHOD_FAILED
    • completion_cause: GramLoadFailure or Error (for ASR errors or when a recognition request is already progressing)
    • completion_reason: the explanatory error message

Asynchronous Recognition events

"start of input" event

In normal mode only, when the recognition process starts hearing a voice, this event is fired:

  • event: START-OF-INPUT

Outcome event

Some time later, when the recognition completes, a RECOGNITION-COMPLETE event is fired, with a completion_cause header that may be one of:

  • Success: at least one of the grammar matched
  • NoInputTimeout: no voice was heard until no_input_timeout expired
  • NoMatch: in normal mode, what was "heard" contradicts all grammars or confidence was too low to accept match
  • NoMatchMaxtime: no match was found before recognition_timeout was reached in normal mode
  • HotwordMaxtime: no match was found before recognition_timeout or hotword_max_duration was reached in hotword mode
  • TooMuchSpeechTimeout: a match was found, but new speech was still matching when recognition_timeout or hotword_max_duration expired (the match is returned)
  • PartialMatch: (normal mode only) only a partial match was found
  • PartialMatchMaxtime: (normal mode only) only a partial was found and new speech continued to partial match the grammar until recognition_timeout or hotword_max_duration expired (the partial match is returned)

The completion_reason header may give additional justification.

The body contains the Recognition Result JSON object.

Start input timers

  • command: START-INPUT-TIMERS

Please be aware that time spent between RECOGNIZE and START-INPUT-TIMERS commands, to allow barge-in for example, holds a transcription worker for your stream. This time is therefore accounted for on your invoice.

Success response

  • event: INPUT-TIMERS-STARTED

Error responses

None

Stop ongoing recognition

  • command: STOP

Success response

  • event: STOPPED
  • headers:
    • active_request_id: the id of the RECOGNIZE request that was cancelled

Error responses

If there is no recognition process in progress, the command is simply ignored.

Close session

  • command: CLOSE

Success response

  • event: CLOSED

Error responses

  • event: METHOD_INVALID if no session was open.

Recognition result

FieldTypePossible Values or unitDescription
asrobjectTranscript objectThe result of the speech to text transcription
nluobjectInterpretation objectInterpreted result (what the engine understood) as structured data
grammar_uristrgrammar URIthe grammar that matched (as specified in the RECOGNIZE request)
versionstrAPI version

Depending of the completion_cause, some or all fields may be empty (null or "").

Transcript

FieldTypePossible Values or unitDescription
transcriptstrUTF-8The raw ASR transcript
confidencefloat0 ≤ c ≤ 1The ASR transcript confidence
startintunix timestamp in msthe start of the transcript timestamp
endintunix timestamp in msthe end of the transcript timestamp

Interpretation

FieldTypePossible Values or unitDescription
typeURIbuiltin Grammar URIThe actual buitin grammar that matched (once all aliases were resolved, without the query part)
valuegrammar dependentgrammar dependentthe actual semantic interpretation
confidencefloat0 ≤ c ≤ 1The interpretation confidence

All currently available grammars return a string as value except:

  • the boolean grammar that returns a boolean value;
  • the address grammar that returns an address object:
{
"type": "builtin:speech/address",
"value":
{
"number": "37",
"street": "rue du docteur leroy",
"zipcode": "72000",
"city": "le mans"
},
"confidence": 0.9
}

In any case, the client should check the type to know how to handle the interpretation, even if it is a plain string.

Complete example

The connection is already granted.

Session initiation

A new user is calling the bot, so the bot opens a new Samosa session and sends:

{
"command": "OPEN",
"request_id": 0,
"channel_id": "test",
"headers": {
"custom_id": "blueprint"
},
"body": ""
}

the command is successful and the server reponds

{
"event": "OPENED",
"request_id": 0,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}

The client starts streaming. Audio packet are omitted here.

Prepare the grammar (the NLU) to interpret the first user answer

The bot expects a French car plate number. For clarity and re-usability, the bot developers define an alias for this, creating a custom grammar, called immat, out of the builtin builtin:speech/spelling/mixed grammar configured with a custom REGEX telling the NLU what a French car plate looks like:

{
"command": "DEFINE-GRAMMAR",
"request_id": 1,
"channel_id": "testuie46e4ui6",
"headers": {
"content_id": "immat",
"content_type": "text/uri-list"
},
"body": "builtin:speech/spelling/mixed?regex=([a-z]{2}[0-9]{3}[a-z]{2})|([0-9]{4}[a-z]{3}[0-9]{2})"
}

The command is successful and the server confirms:

{
"event": "GRAMMAR-DEFINED",
"request_id": 1,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}

Set some defaults values

To keep the recognition command "light", the bot developers set the recognition language and confidence threshold once for the duration of the session:

{
"command": "SET-PARAMS",
"request_id": 2,
"channel_id": "testuie46e4ui6",
"headers": {
"speech_language": "fr",
"confidence_threshold": 0.7
},
"body": ""
}

The command is successful, the server responds

{
"event": "PARAMS-SET",
"request_id": 2,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}

Start the recognition

The bot asks the user to spell their car plate number and then instructs the Samosa session to listen to the user and interpret what they are saying:

{
"command": "RECOGNIZE",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"headers": {
"recognition_mode": "normal",
"no_input_timout": 5000,
"recognition_timeout": 30000,
"speech_complete_timeout": 800,
"speech_incomplete_timeout": 1500,
"speech_nomatch_timeout": 3000,
"content_type": "text/uri-list"
},
"body": "session:immat"
}

Instead of session:immat, the bot devs could have decided not to define the alias and to directly use the builtin:speech/spelling/mixed?regex=([a-z]{2}[0-9]{3}[a-z]{2})|([0-9]{4}[a-z]{3}[0-9]{2}) expression here.

The process is successfully started and the server responds:

{
"event": "RECOGNITION-IN-PROGRESS",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"completion_cause": "Success",
"completion_reason": null,
"headers": {},
"body": ""
}

Receive results

The bots now waits for more events from the server, while the user is speaking — hopefully spelling their car plate number, among other things (the NLU is quite robust).

As we are using normal mode, as soon as the user's voice is detected, the server sends the START-OF-INPUT event:

{
"event": "START-OF-INPUT",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}

Sometime later, when the user is done speaking, the server returns what it has understood (in this case, a success):

{
"event": "RECOGNITION-COMPLETE",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"completion_cause": "Success",
"completion_reason": null,
"headers": {},
"body": {
"asr": {
"transcript": "attendez alors voilà baissé trois cent cinq f z",
"confidence": 0.9,
"start": 1629453934909,
"end": 1629453944833
},
"nlu": {
"type": "builtin:speech/spelling/mixed",
"value": "bc305fz",
"confidence": 0.86
},
"grammar_uri": "session:immat",
"version": "1.25.0"
}
}