WebSocket for voicebots
All int
type are unsigned 64 bit integers.
Messages and events
Here are the precise definitions of the different messages and events.
Except the audio packets that are sent as binary WS messages, all commands from the client (requests) are sent as WS text messages containing UTF-8 encoded JSON objects in the form:
{
"command": "some-command-name",
"request_id": 0,
"channel_id": "uie46e4ui6",
"headers": {},
"body": "unicode text"
}
where:
Field | Type | Possible Values | Description |
---|---|---|---|
command | str | see overview | The command name |
request_id | int | monotonic counter | unique request identifier set by the client for reference by responses or events from the server |
channel_id | str | unique identifier | session identifier set by the server and repeated by the client |
headers | map | command dependent | command specific parameters |
body | text | command dependent | command specific payload |
The command used to open a session doesn't need to provide a channel_id
since it is not set yet. If that command provides a value in that field, it will be used as a prefix for the actual channel_id
returned by the server. The channel_id
doesn't change during the session.
The channel_id
is only useful for debugging now, as the couple channel_id
+ request_id
uniquely identifies a request (and its follow-up responses) across sessions and clients. In the future, it may be used to multiplex several parallel session in the same connection.
All responses to commands or events are also text WS messages containing UTF-8 encoded JSON objects in the form:
{
"event": "some-event-name",
"request_id": 0,
"channel_id": "uie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": {}
}
where:
Field | Type | Possible Values | Description |
---|---|---|---|
event | str | see for each command below | event name |
request_id | int | any request_id already received from the client | reference to the request this event responds to |
channel_id | str | unique identifier | session identifier set by the server |
completion_cause | str / null | event dependent | optional: complementary status of event |
completion_reason | str / null | free-form explanation message | |
headers | map | event dependent | event specific attributes |
body | object | event dependent | event payload |
In the definitions below, we won't repeat the request_id
or channel_id
and empty fields are omitted.
Open a session
- command:
OPEN
- headers:
custom_id
: string, optional, freely set by the client (to identify its own customer for example)session_id
: string, optional, freely set by the client to identify the session on its side, an that will be visible in the server logs an in the Dev Console.audio_codec
: string, to set the codec of the streamed audio:"linear"
: raw PCM 16 bit signed little endian samples at 8khz rate;"g711a"
: G711 a-law at 8khz sample rate;"g711u"
: G711 μ-law at 8khz sample rate.
The custom_id
sent in the header is reproduced unchanged on usage reports and invoices.
Success response
- event:
OPENED
- channel_id: string, the identifier given to this session by the server.
Error responses
- event =
METHOD-NOT-VALID
when a session is already opened (this event has nochannel_id
) - event =
INVALID-PARAM-VALUE
in case of JSON schema error (this event has norequest_id
, nochannel_id
) - event =
METHOD-FAILED
for other errors with:- completion_cause =
Error
- completion_reason = the actual error explanation
- completion_cause =
In case of a METHOD-FAILED
response, the channel must still be closed by the client.
All opened sessions must be closed properly.
Close a session
- command:
CLOSE
Response
- event:
CLOSED
After a session is closed, you can either start another session in the same websocket connection or close the websocket connection.
If you choose to close the websocket connection, the client should initiate the websocket close handshake and wait for the server to close the connection.
Stream audio
Audio samples are sent as WS binary messages. Unless a codec is set when opening a session (in future iterations), the audio must be in raw PCM format (no headers, no attributes, just raw audio): frames must be 16 bits signed little endian integers sampled at 8khz, mono only.
For efficiency on WS, the samples should be at least 50 milliseconds long, i.e. 400 frames or 800 bytes (in linear, 400 bytes in G711), but less than 100 ms to keep latency low. So if you are converting an RTP stream which transmits 10 to 20 ms packets you'll have to buffer them — or better yet, use the MRCP API instead of this WS API. That's because WS (TCP only) has more overhead than datagram based protocols.
The server won't return any acknowledgement upon receiving audio packets.
If you send truncated frames, i.e. an odd number of bytes, the server will close the session:
- event:
CLOSED
- completion_cause:
Error
- completion_reason:
truncated frame in audio packet
Set session defaults
Set session default values for the recognition parameters.
This command is optional. Use it if you don't want to set and repeat the recognition parameters on each individual RECOGNIZE
request. It's a matter of taste and client implementation.
- command:
SET-PARAMS
- headers are any subset (all optional) of:
Field | Type | Possible Values or unit | Description |
---|---|---|---|
no_input_timeout | int | milliseconds | When recognition is started and there is no speech detected for this period of time, the recognizer can send a RECOGNITION-COMPLETE event to the client with a completion_cause of NoInputTimeout |
speech_complete_timeout | int | (normal mode only) ms | The speech-complete-timeout value applies when the recognizer currently has a complete match against an active grammar, and specifies how long the recognizer MUST wait for more input before declaring a match. |
speech_incomplete_timeout | int | (normal mode only) ms | The incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars. In this case, once the timeout is triggered, the partial result is returned with a completion_cause of PartialMatch . |
speech_nomatch_timeout | int | (normal mode only) ms | The nomatch timeout applies when the speech prior to the silence doesn't match any of the active grammars. In this case, once the timeout is triggered, the transcript speech input is returned without interpretation and with a completion_cause of NoMatch . |
hotword_min_duration | int | (hotword only) ms | It specifies the minimum duration of an utterance that will be considered for hotword recognition |
hotword_max_duration | int | (hotword only) ms | It specifies the maximum duration of an utterance that will be considered for hotword recognition |
recognition_timeout | int | ms | when recognition is started and there is no match for a certain period of time, the recognizer can send a RECOGNITION-COMPLETE event to the client and terminate the recognition operation |
confidence_threshold | float | 0 ≤ l ≤ 1 | this field tells the recognizer resource what confidence level the client considers a successful match (default is 0.5) |
sensitivity_level | float | 0 ≤ l ≤ 1 | this field sets the Voice Activity Detection (VAD) sensitivity (default is 0.5) |
speech_language | str | RFC 5646 | we only support fr, fr-FR, en, en-US, en-GB for the moment, plus some business domain extensions |
logging_tag | str | any | This string will be used to tag all the interaction following the SET-PARAMS command, so they can be tracked in the logs and Dev Console |
Unknown headers are just ignored without error.
In hotword mode, there is a subtle difference between recognition_timeout
and hotword_max_duration
: The recognition_timeout
timer is reset on each silence whereas the hotword_max_duration
timer is not.
In Normal mode, the recognition_timeout
timer is not reset on silences!
Success response
- event:
PARAMS-SET
Error responses
event:
METHOD-FAILED
- completion_cause:
LanguageUnsupported
- completion_reason contains details
- completion_cause:
event:
INVALID-PARAM-VALUE
- we were unable to parse the JSON payload (wrong types or syntax error)
- completion_cause:
Error
, - completion_reason: detailed error message
- as the command was not decoded, the
request_id
header of this response may be0
, and thechannel_id
could be empty.
Why two different events? In the first case, the value (language tag) is valid but not (currently) supported vs invalid values that will never be correct. For exemple "speech_language": "ar‑SA"
would raise a METHOD-FAILED
but "speech_language": 78.6
would raise an INVALID-PARAM-VALUE
.
Get session defaults
Get session default values of the recognition parameters.
- command
GET-PARAMS
Success response
- event:
DEFAULT-PARAMS
- headers: all headers from
SET-PARAMS
at once
- headers: all headers from
Error responses
None, this command never fails.
Define a grammar alias
- command:
DEFINE-GRAMMAR
- headers:
Field | Type | Possible Values or unit | Description |
---|---|---|---|
content_id | str | ASCII string | the alias id you want to define (without the session: prefix) |
content_type | str | "text/uri-list" | The type on the body. Only "text/uri-list" is supported for the moment |
- body: the grammar you want to alias. Examples:
builtin:speech/address
builtin:speech/keywords?alternatives=facture|commande|compte|conseiller
builtin:speech/spelling/mixed?regex=[a-z]{2}[0-9]{9}[a-z]{2}
Success response
- event:
GRAMMAR-DEFINED
Error responses
event:
METHOD_FAILED
- completion_cause:
GramDefinitionFailure
orGramLoadFailure
- completion_reason: the explanatory error message
- completion_cause:
event:
METHOD_NOT_VALID
if a recognition process is in progressevent:
MISSING_PARAM
ifcontent-id
is missing
Start recognition
- command:
RECOGNIZE
- headers: the same as
SET-PARAMS
, exceptLogging-Tag
, plus:
Field | Type | Possible Values or unit | Description |
---|---|---|---|
recognition_mode | str | "normal" / "hotword" | the recognition mode |
start_input_timers | bool | a value of false tells the recognizer to start recognition but not to start the no-input timer yet. Default is false. | |
content_type | str | "text/uri-list" | The type on the body. Only "text/uri-list" is supported for the moment |
If the client chooses not to start the timers immediately, it should issue a START-INPUT-TIMERS
command later.
- body (required): multi-line string, contains the grammar to use, one per line (order is significant for priorities).
The grammar references may be builtin grammars (builtin:…
) or aliases prefixed by session:
.
If more than one grammar is given, the server will try to match them all at the same time. The first matching grammar "wins". If several grammars match at once, the one that is earlier in the list has priority.
Future recognition events will reference this request in their request_id
.
Success response
- event:
RECOGNITION-IN-PROGRESS
Error responses
- event:
METHOD_FAILED
- completion_cause:
GramLoadFailure
orError
(for ASR errors or when a recognition request is already progressing) - completion_reason: the explanatory error message
- completion_cause:
Asynchronous Recognition events
"start of input" event
In normal mode only, when the recognition process starts hearing a voice, this event is fired:
- event:
START-OF-INPUT
Outcome event
Some time later, when the recognition completes, a RECOGNITION-COMPLETE
event is fired, with a completion_cause
header that may be one of:
Success
: at least one of the grammar matchedNoInputTimeout
: no voice was heard untilno_input_timeout
expiredNoMatch
: in normal mode, what was "heard" contradicts all grammars or confidence was too low to accept matchNoMatchMaxtime
: no match was found beforerecognition_timeout
was reached in normal modeHotwordMaxtime
: no match was found beforerecognition_timeout
orhotword_max_duration
was reached in hotword modeTooMuchSpeechTimeout
: a match was found, but new speech was still matching whenrecognition_timeout
orhotword_max_duration
expired (the match is returned)PartialMatch
: (normal mode only) only a partial match was foundPartialMatchMaxtime
: (normal mode only) only a partial was found and new speech continued to partial match the grammar untilrecognition_timeout
orhotword_max_duration
expired (the partial match is returned)
The completion_reason
header may give additional justification.
The body contains the Recognition Result JSON object.
Start input timers
- command:
START-INPUT-TIMERS
Please be aware that time spent between RECOGNIZE
and START-INPUT-TIMERS
commands, to allow barge-in for example, holds a transcription worker for your stream. This time is therefore accounted for on your invoice.
Success response
- event:
INPUT-TIMERS-STARTED
Error responses
None
Stop ongoing recognition
- command:
STOP
Success response
- event:
STOPPED
- headers:
- active_request_id: the id of the
RECOGNIZE
request that was cancelled
- active_request_id: the id of the
Error responses
If there is no recognition process in progress, the command is simply ignored.
Close session
- command:
CLOSE
Success response
- event:
CLOSED
Error responses
- event:
METHOD_INVALID
if no session was open.
Recognition result
Field | Type | Possible Values or unit | Description |
---|---|---|---|
asr | object | Transcript object | The result of the speech to text transcription |
nlu | object | Interpretation object | Interpreted result (what the engine understood) as structured data |
grammar_uri | str | grammar URI | the grammar that matched (as specified in the RECOGNIZE request) |
version | str | API version |
Depending of the completion_cause
, some or all fields may be empty (null
or ""
).
Transcript
Field | Type | Possible Values or unit | Description |
---|---|---|---|
transcript | str | UTF-8 | The raw ASR transcript |
confidence | float | 0 ≤ c ≤ 1 | The ASR transcript confidence |
start | int | unix timestamp in ms | the start of the transcript timestamp |
end | int | unix timestamp in ms | the end of the transcript timestamp |
Interpretation
Field | Type | Possible Values or unit | Description |
---|---|---|---|
type | URI | builtin Grammar URI | The actual buitin grammar that matched (once all aliases were resolved, without the query part) |
value | grammar dependent | grammar dependent | the actual semantic interpretation |
confidence | float | 0 ≤ c ≤ 1 | The interpretation confidence |
All currently available grammars return a string as value except:
- the boolean grammar that returns a boolean value;
- the address grammar that returns an address object:
{
"type": "builtin:speech/address",
"value":
{
"number": "37",
"street": "rue du docteur leroy",
"zipcode": "72000",
"city": "le mans"
},
"confidence": 0.9
}
In any case, the client should check the type
to know how to handle the interpretation, even if it is a plain string.
Complete example
The connection is already granted.
Session initiation
A new user is calling the bot, so the bot opens a new Samosa session and sends:
{
"command": "OPEN",
"request_id": 0,
"channel_id": "test",
"headers": {
"custom_id": "blueprint"
},
"body": ""
}
the command is successful and the server reponds
{
"event": "OPENED",
"request_id": 0,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}
The client starts streaming. Audio packet are omitted here.
Prepare the grammar (the NLU) to interpret the first user answer
The bot expects a French car plate number. For clarity and re-usability, the bot developers define an alias for this, creating a custom grammar, called immat
, out of the builtin builtin:speech/spelling/mixed
grammar configured with a custom REGEX telling the NLU what a French car plate looks like:
{
"command": "DEFINE-GRAMMAR",
"request_id": 1,
"channel_id": "testuie46e4ui6",
"headers": {
"content_id": "immat",
"content_type": "text/uri-list"
},
"body": "builtin:speech/spelling/mixed?regex=([a-z]{2}[0-9]{3}[a-z]{2})|([0-9]{4}[a-z]{3}[0-9]{2})"
}
The command is successful and the server confirms:
{
"event": "GRAMMAR-DEFINED",
"request_id": 1,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}
Set some defaults values
To keep the recognition command "light", the bot developers set the recognition language and confidence threshold once for the duration of the session:
{
"command": "SET-PARAMS",
"request_id": 2,
"channel_id": "testuie46e4ui6",
"headers": {
"speech_language": "fr",
"confidence_threshold": 0.7
},
"body": ""
}
The command is successful, the server responds
{
"event": "PARAMS-SET",
"request_id": 2,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}
Start the recognition
The bot asks the user to spell their car plate number and then instructs the Samosa session to listen to the user and interpret what they are saying:
{
"command": "RECOGNIZE",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"headers": {
"recognition_mode": "normal",
"no_input_timout": 5000,
"recognition_timeout": 30000,
"speech_complete_timeout": 800,
"speech_incomplete_timeout": 1500,
"speech_nomatch_timeout": 3000,
"content_type": "text/uri-list"
},
"body": "session:immat"
}
Instead of session:immat
, the bot devs could have decided not to define the alias and to directly use the builtin:speech/spelling/mixed?regex=([a-z]{2}[0-9]{3}[a-z]{2})|([0-9]{4}[a-z]{3}[0-9]{2})
expression here.
The process is successfully started and the server responds:
{
"event": "RECOGNITION-IN-PROGRESS",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"completion_cause": "Success",
"completion_reason": null,
"headers": {},
"body": ""
}
Receive results
The bots now waits for more events from the server, while the user is speaking — hopefully spelling their car plate number, among other things (the NLU is quite robust).
As we are using normal
mode, as soon as the user's voice is detected, the server sends the START-OF-INPUT
event:
{
"event": "START-OF-INPUT",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}
Sometime later, when the user is done speaking, the server returns what it has understood (in this case, a success):
{
"event": "RECOGNITION-COMPLETE",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"completion_cause": "Success",
"completion_reason": null,
"headers": {},
"body": {
"asr": {
"transcript": "attendez alors voilà baissé trois cent cinq f z",
"confidence": 0.9,
"start": 1629453934909,
"end": 1629453944833
},
"nlu": {
"type": "builtin:speech/spelling/mixed",
"value": "bc305fz",
"confidence": 0.86
},
"grammar_uri": "session:immat",
"version": "1.25.0"
}
}