WebSocket for voicebots

All int type are unsigned 64 bit integers.

Messages and events

Here are the precise definitions of the different messages and events.

Except the audio packets that are sent as binary WS messages, all commands from the client (requests) are sent as WS text messages containing UTF-8 encoded JSON objects in the form:

{
  "command": "some-command-name",
  "request_id": 0,
  "channel_id": "uie46e4ui6",
  "headers": {},
  "body": "unicode text"
}

where:

Field	Type	Possible Values	Description
command	str	see overview	The command name
request_id	int	monotonic counter	unique request identifier set by the client for reference by responses or events from the server
channel_id	str	unique identifier	session identifier set by the server and repeated by the client
headers	map	command dependent	command specific parameters
body	text	command dependent	command specific payload

The command used to open a session doesn't need to provide a channel_id since it is not set yet. If that command provides a value in that field, it will be used as a prefix for the actual channel_id returned by the server. The channel_id doesn't change during the session.

The channel_id is only useful for debugging now, as the couple channel_id + request_id uniquely identifies a request (and its follow-up responses) across sessions and clients. In the future, it may be used to multiplex several parallel session in the same connection.

All responses to commands or events are also text WS messages containing UTF-8 encoded JSON objects in the form:

{
  "event": "some-event-name",
  "request_id": 0,
  "channel_id": "uie46e4ui6",
  "completion_cause": null,
  "completion_reason": null,
  "headers": {},
  "body": {}
}

where:

Field	Type	Possible Values	Description
event	str	see for each command below	event name
request_id	int	any request_id already received from the client	reference to the request this event responds to
channel_id	str	unique identifier	session identifier set by the server
completion_cause	str / null	event dependent	optional: complementary status of event
completion_reason	str / null		free-form explanation message
headers	map	event dependent	event specific attributes
body	object	event dependent	event payload

In the definitions below, we won't repeat the request_id or channel_id and empty fields are omitted.

Open a session

command: OPEN
headers:
- custom_id: string, optional, freely set by the client (to identify its own customer for example)
- session_id: string, optional, freely set by the client to identify the session on its side, an that will be visible in the server logs an in the Dev Console.
- audio_codec: string, to set the codec of the streamed audio:
  - "linear": raw PCM 16 bit signed little endian samples at 8khz rate;
  - "g711a": G711 a-law at 8khz sample rate;
  - "g711u": G711 μ-law at 8khz sample rate.

The custom_id sent in the header is reproduced unchanged on usage reports and invoices.

Success response

event: OPENED
channel_id: string, the identifier given to this session by the server.

Error responses

event = METHOD-NOT-VALID when a session is already opened (this event has no channel_id)
event = INVALID-PARAM-VALUE in case of JSON schema error (this event has no request_id, no channel_id)
event = METHOD-FAILED for other errors with:
- completion_cause = Error
- completion_reason = the actual error explanation

In case of a METHOD-FAILED response, the channel must still be closed by the client.

All opened sessions must be closed properly.

Close a session

command: CLOSE

Response

event: CLOSED

After a session is closed, you can either start another session in the same websocket connection or close the websocket connection.

If you choose to close the websocket connection, the client should initiate the websocket close handshake and wait for the server to close the connection.

Stream audio

Audio samples are sent as WS binary messages. Unless a codec is set when opening a session (in future iterations), the audio must be in raw PCM format (no headers, no attributes, just raw audio): frames must be 16 bits signed little endian integers sampled at 8khz, mono only.

For efficiency on WS, the samples should be at least 50 milliseconds long, i.e. 400 frames or 800 bytes (in linear, 400 bytes in G711), but less than 100 ms to keep latency low. So if you are converting an RTP stream which transmits 10 to 20 ms packets you'll have to buffer them — or better yet, use the MRCP API instead of this WS API. That's because WS (TCP only) has more overhead than datagram based protocols.

The server won't return any acknowledgement upon receiving audio packets.

If you send truncated frames, i.e. an odd number of bytes, the server will close the session:

event: CLOSED
completion_cause: Error
completion_reason: truncated frame in audio packet

Set session defaults

Set session default values for the recognition parameters.

This command is optional. Use it if you don't want to set and repeat the recognition parameters on each individual RECOGNIZE request. It's a matter of taste and client implementation.

command: SET-PARAMS
headers are any subset (all optional) of:

Field	Type	Possible Values or unit	Description
no_input_timeout	int	milliseconds	When recognition is started and there is no speech detected for this period of time, the recognizer can send a `RECOGNITION-COMPLETE` event to the client with a `completion_cause` of `NoInputTimeout`
speech_complete_timeout	int	(normal mode only) ms	The `speech-complete-timeout` value applies when the recognizer currently has a complete match against an active grammar, and specifies how long the recognizer MUST wait for more input before declaring a match.
speech_incomplete_timeout	int	(normal mode only) ms	The incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars. In this case, once the timeout is triggered, the partial result is returned with a `completion_cause` of `PartialMatch`.
speech_nomatch_timeout	int	(normal mode only) ms	The nomatch timeout applies when the speech prior to the silence doesn't match any of the active grammars. In this case, once the timeout is triggered, the transcript speech input is returned without interpretation and with a `completion_cause` of `NoMatch`.
hotword_min_duration	int	(hotword only) ms	It specifies the minimum duration of an utterance that will be considered for hotword recognition
hotword_max_duration	int	(hotword only) ms	It specifies the maximum duration of an utterance that will be considered for hotword recognition
recognition_timeout	int	ms	when recognition is started and there is no match for a certain period of time, the recognizer can send a `RECOGNITION-COMPLETE` event to the client and terminate the recognition operation
confidence_threshold	float	0 ≤ l ≤ 1	this field tells the recognizer resource what confidence level the client considers a successful match (default is 0.5)
n_best_list_length	int	1 ≤ l ≤ 5	In case of ambiguous speech, how many alternative interpretations should be returned (default is 1, the best one).
sensitivity_level	float	0 ≤ l ≤ 1	this field sets the Voice Activity Detection (VAD) sensitivity (default is 0.5)
speech_language	str	RFC 5646	we only support fr, fr-FR, en, en-US, en-GB for the moment, plus some business domain extensions
logging_tag	str	any	This string will be used to tag all the interaction following the `SET-PARAMS` command, so they can be tracked in the logs and Dev Console

Unknown headers are just ignored without error.

In hotword mode, there is a subtle difference between recognition_timeout and hotword_max_duration: The recognition_timeout timer is reset on each silence whereas the hotword_max_duration timer is not.

In Normal mode, the recognition_timeout timer is not reset on silences!

Success response

event: PARAMS-SET

Error responses

event: METHOD-FAILED
- completion_cause: LanguageUnsupported
- completion_reason contains details
event: INVALID-PARAM-VALUE
- we were unable to parse the JSON payload (wrong types or syntax error)
- completion_cause: Error,
- completion_reason: detailed error message
- as the command was not decoded, the request_id header of this response may be 0, and the channel_id could be empty.

Why two different events? In the first case, the value (language tag) is valid but not (currently) supported vs invalid values that will never be correct. For exemple "speech_language": "ar‑SA" would raise a METHOD-FAILED but "speech_language": 78.6 would raise an INVALID-PARAM-VALUE.

Get session defaults

Get session default values of the recognition parameters.

command GET-PARAMS

Success response

event: DEFAULT-PARAMS
- headers: all headers from SET-PARAMS at once

Error responses

None, this command never fails.

Define a grammar alias

command: DEFINE-GRAMMAR
headers:

Field	Type	Possible Values or unit	Description
content_id	str	ASCII string	the alias id you want to define (without the `session:` prefix)
content_type	str	"text/uri-list"	The type on the body. Only "text/uri-list" is supported for the moment

body: the grammar you want to alias. Examples:
- builtin:speech/address
- builtin:speech/keywords?alternatives=facture|commande|compte|conseiller
- builtin:speech/spelling/mixed?regex=[a-z]{2}[0-9]{9}[a-z]{2}

Success response

event: GRAMMAR-DEFINED

Error responses

event: METHOD_FAILED
- completion_cause: GramDefinitionFailure or GramLoadFailure
- completion_reason: the explanatory error message
event: METHOD_NOT_VALID if a recognition process is in progress
event: MISSING_PARAM if content-id is missing

Start recognition

command: RECOGNIZE
headers: the same as SET-PARAMS, except Logging-Tag, plus:

Field	Type	Possible Values or unit	Description
recognition_mode	str	"normal" / "hotword"	the recognition mode
start_input_timers	bool		a value of false tells the recognizer to start recognition but not to start the no-input timer yet. Default is false.
content_type	str	"text/uri-list"	The type on the body. Only "text/uri-list" is supported for the moment

If the client chooses not to start the timers immediately, it should issue a START-INPUT-TIMERS command later.

body (required): multi-line string, contains the grammar to use, one per line (order is significant for priorities).

The grammar references may be builtin grammars (builtin:…) or aliases prefixed by session:.

If more than one grammar is given, the server will try to match them all at the same time. The first matching grammar "wins". If several grammars match at once, the one that is earlier in the list has priority.

Future recognition events will reference this request in their request_id.

Success response

event: RECOGNITION-IN-PROGRESS

Error responses

event: METHOD_FAILED
- completion_cause: GramLoadFailure or Error (for ASR errors or when a recognition request is already progressing)
- completion_reason: the explanatory error message

Asynchronous Recognition events

"start of input" event

In normal mode only, when the recognition process starts hearing a voice, this event is fired:

event: START-OF-INPUT

Outcome event

Some time later, when the recognition completes, a RECOGNITION-COMPLETE event is fired, with a completion_cause header that may be one of:

Success: at least one of the grammar matched
NoInputTimeout: no voice was heard until no_input_timeout expired
NoMatch: in normal mode, what was "heard" contradicts all grammars or confidence was too low to accept match
NoMatchMaxtime: no match was found before recognition_timeout was reached in normal mode
HotwordMaxtime: no match was found before recognition_timeout or hotword_max_duration was reached in hotword mode
SuccessMaxtime: a match was found, but new speech was still matching when recognition_timeout or hotword_max_duration expired (the match is returned)
PartialMatch: (normal mode only) only a partial match was found
PartialMatchMaxtime: (normal mode only) only a partial was found and new speech continued to partial match the grammar until recognition_timeout or hotword_max_duration expired (the partial match is returned)

The completion_reason header may give additional justification.

The body contains the Recognition Result JSON object.

Start input timers

command: START-INPUT-TIMERS

Please be aware that time spent between RECOGNIZE and START-INPUT-TIMERS commands, to allow barge-in for example, holds a transcription worker for your stream. This time is therefore accounted for on your invoice.

Success response

event: INPUT-TIMERS-STARTED

Error responses

None

Stop ongoing recognition

command: STOP

Success response

event: STOPPED
headers:
- active_request_id: the id of the RECOGNIZE request that was cancelled

Error responses

If there is no recognition process in progress, the command is simply ignored.

Close session

command: CLOSE

Success response

event: CLOSED

Error responses

event: METHOD_INVALID if no session was open.

Recognition result

Field	Type	Possible Values or unit	Description
asr	object	Transcript object	The result of the speech to text transcription
nlu	object	Interpretation object	Interpreted result (what the engine understood) as structured data
grammar_uri	str	grammar URI	the grammar that matched (as specified in the `RECOGNIZE` request)
version	str	API version
alternatives	List[Recognition Result]		only present on top level if more than one match possible and `n_best_list_length` > 1

Depending of the completion_cause, some or all fields may be empty (null or "").

Transcript

Field	Type	Possible Values or unit	Description
transcript	str	UTF-8	The raw ASR transcript
confidence	float	0 ≤ c ≤ 1	The ASR transcript confidence
start	int	unix timestamp in ms	the start of the transcript timestamp
end	int	unix timestamp in ms	the end of the transcript timestamp

Interpretation

Field	Type	Possible Values or unit	Description
type	URI	builtin Grammar URI	The actual buitin grammar that matched (once all aliases were resolved, without the query part)
value	grammar dependent	grammar dependent	the actual semantic interpretation
confidence	float	0 ≤ c ≤ 1	The interpretation confidence

All currently available grammars return a string as value except:

the boolean grammar that returns a boolean value;
the address grammar that returns an address object:

{
  "type": "builtin:speech/address",
  "value":
    {
      "number": "37",
      "street": "rue du docteur leroy",
      "zipcode": "72000",
      "city": "le mans"
    },
  "confidence": 0.9
}

In any case, the client should check the type to know how to handle the interpretation, even if it is a plain string.

Complete example

The connection is already granted.

Session initiation

A new user is calling the bot, so the bot opens a new Samosa session and sends:

{
  "command": "OPEN",
  "request_id": 0,
  "channel_id": "test",
  "headers": {
    "custom_id": "blueprint"
  },
  "body": ""
}

the command is successful and the server reponds

{
  "event": "OPENED",
  "request_id": 0,
  "channel_id": "testuie46e4ui6",
  "completion_cause": null,
  "completion_reason": null,
  "headers": {},
  "body": ""
}

The client starts streaming. Audio packet are omitted here.

Prepare the grammar (the NLU) to interpret the first user answer

The bot expects a French car plate number. For clarity and re-usability, the bot developers define an alias for this, creating a custom grammar, called immat, out of the builtin builtin:speech/spelling/mixed grammar configured with a custom REGEX telling the NLU what a French car plate looks like:

{
  "command": "DEFINE-GRAMMAR",
  "request_id": 1,
  "channel_id": "testuie46e4ui6",
  "headers": {
    "content_id": "immat",
    "content_type": "text/uri-list"
  },
  "body": "builtin:speech/spelling/mixed?regex=([a-z]{2}[0-9]{3}[a-z]{2})|([0-9]{4}[a-z]{3}[0-9]{2})"
}

The command is successful and the server confirms:

{
  "event": "GRAMMAR-DEFINED",
  "request_id": 1,
  "channel_id": "testuie46e4ui6",
  "completion_cause": null,
  "completion_reason": null,
  "headers": {},
  "body": ""
}

Set some defaults values

To keep the recognition command "light", the bot developers set the recognition language and confidence threshold once for the duration of the session:

{
  "command": "SET-PARAMS",
  "request_id": 2,
  "channel_id": "testuie46e4ui6",
  "headers": {
    "speech_language": "fr",
    "confidence_threshold": 0.7
  },
  "body": ""
}

The command is successful, the server responds

{
  "event": "PARAMS-SET",
  "request_id": 2,
  "channel_id": "testuie46e4ui6",
  "completion_cause": null,
  "completion_reason": null,
  "headers": {},
  "body": ""
}

Start the recognition

The bot asks the user to spell their car plate number and then instructs the Samosa session to listen to the user and interpret what they are saying:

{
  "command": "RECOGNIZE",
  "request_id": 3,
  "channel_id": "testuie46e4ui6",
  "headers": {
    "recognition_mode": "normal",
    "no_input_timout": 5000,
    "recognition_timeout": 30000,
    "speech_complete_timeout": 800,
    "speech_incomplete_timeout": 1500,
    "speech_nomatch_timeout": 3000,
    "content_type": "text/uri-list"
  },
  "body": "session:immat"
}

Instead of session:immat, the bot devs could have decided not to define the alias and to directly use the builtin:speech/spelling/mixed?regex=([a-z]{2}[0-9]{3}[a-z]{2})|([0-9]{4}[a-z]{3}[0-9]{2}) expression here.

The process is successfully started and the server responds:

{
  "event": "RECOGNITION-IN-PROGRESS",
  "request_id": 3,
  "channel_id": "testuie46e4ui6",
  "completion_cause": "Success",
  "completion_reason": null,
  "headers": {},
  "body": ""
}

Receive results

The bots now waits for more events from the server, while the user is speaking — hopefully spelling their car plate number, among other things (the NLU is quite robust).

As we are using normal mode, as soon as the user's voice is detected, the server sends the START-OF-INPUT event:

{
  "event": "START-OF-INPUT",
  "request_id": 3,
  "channel_id": "testuie46e4ui6",
  "completion_cause": null,
  "completion_reason": null,
  "headers": {},
  "body": ""
}

Sometime later, when the user is done speaking, the server returns what it has understood (in this case, a success):

{
  "event": "RECOGNITION-COMPLETE",
  "request_id": 3,
  "channel_id": "testuie46e4ui6",
  "completion_cause": "Success",
  "completion_reason": null,
  "headers": {},
  "body": {
    "asr": {
          "transcript": "attendez alors voilà baissé trois cent cinq f z",
          "confidence": 0.9,
          "start": 1629453934909,
          "end": 1629453944833
      },
    "nlu": {
          "type": "builtin:speech/spelling/mixed",
          "value": "bc305fz",
          "confidence": 0.86
      },
    "grammar_uri": "session:immat",
    "version": "1.25.0"
  }
}