Protocol reference
Protocol flow overview
- WS Handshake: the client initiate the connection, passing its authentication token (JWT) in the
Authorization
HTTP Header. - The server verifies the token and grants access if the token is valid or refuses.
Once the client is granted access, it can run several successive sessions without disconnecting and reconnecting — quota management is left out of this blueprint as it is a wider concern that just this service.
A session is a context to which attributes (called "session" attributes) can be attached and a unit of work that scopes "global" or defaults recognition parameters the client may set and in which recognition requests are issued. In general, we recommend that a session matches a human-bot interview or "conversation" (e.g. a phone call): the session is started at the beginning of the phone call and stopped at hang up.
Contrary to the conversation based API, only one speaker audio is streaming during a session and the ASR doesn't run continuously. The ASR+NLU is started by a recognition request when the client needs it, and it stops when the expected information is found or when some timer expires. The client describes the information it expects by one or more grammars. A grammar is a kind of preset identified by an URI that will setup both our ASR and NLU engines for the task. Some of those presets can be further customized by the client by passing parameters to them in a "query" string.
The client can issue several recognition requests during a session, but at any given time, at most one request may be active. You cannot have concurrent recognition requests running in parallel. If you want to try several possible interpretations at the same time, you would use one request with several grammars; see recognition request below.
- The client opens a new session, specifying session attributes that will then appear on invoices. Session attributes are useful for the client's accounting purposes.
- Once a session is opened, the client can start streaming audio. No transcription will occur yet (and hence, no fees!).
- While streaming, the client can send different commands (in the same connection) to the server:
SET-PARAMS
— set default recognition parameters that will be valid for the rest of the session or until anotherSET-PARAMS
command replaces them;DEFINE-GRAMMAR
— define a convenient alias for a grammar, valid for the rest of the session (there is no point in redefining an existing alias);RECOGNIZE
— start the ASR and interpretation process; the command can temporarily override any parameter previously set withSET-PARAMS
; the recognition parameters (including the chosen grammars) define precisely how and when the recognition stops;START-INPUT-TIMERS
— among the termination conditions are timeouts; if theRECOGNIZE
didn't start the timers already, the client can start them later with this command;STOP
— stop the ongoingRECOGNIZE
process before it is due to terminate;GET-PARAMS
— get the current session parameter values.
- The client can also close the session. It must stop streaming then.
The server respond to each message with a success or error message.
Any audio packet sent outside of a session is ignored instead of raising an error because in practice the client won't always be able to perfectly synchronize the audio and the commands (this was noticed in real life with MRCP). Indeed, a client typical architecture involves different threads or subsystems for bot orchestration and audio streaming.
Besides the responses to the requests, the server emits some events as the recognition process progresses:
START-OF-INPUT
— in normal mode (see below), the server emits this event when voice is first detected;RECOGNITION-COMPLETE
— when the recognition process is complete, whether it succeeded or failed.
Recognition modes
Two recognition modes are proposed: normal mode and hotword. They differ mainly on how they match grammars and how they terminate. When a client issue a RECOGNIZE
request, it must specify the recognition mode.
The semantics are the same as those of MRCP.
Messages and events
Here are the precise definitions of the different messages and events.
Except the audio packets that are sent as binary WS messages, all commands from the client (requests) are sent as WS text message containing UTF-8 encoded JSON objects in the form:
{
"command": "some-command-name",
"request_id": 0,
"channel_id": "uie46e4ui6",
"headers": {},
"body": "unicode text"
}
where:
Field | Type | Possible Values | Description |
---|---|---|---|
command | str | see above | The command name |
request_id | int | monotonic counter | unique request identifier set by the client for reference by responses or events from the server |
channel_id | str | unique identifier | session identifier set by the server and repeated by the client |
headers | map | command dependent | command specific parameters |
body | text | command dependent | command specific payload |
The command used to open a session doesn't need to provide a channel_id
since it is not set yet. If that command provides a value in that field, it will be used as a prefix for the actual channel_id
returned by the server. The channel_id
doesn't change during the session.
The channel_id
is only useful for debugging now, as the couple channel_id
+ request_id
uniquely identifies a request (and its follow-up responses) across sessions and clients. In the future, it may be used to multiplex several parallel session in the same connection.
All responses to commands or events are also text WS messages containing UTF-8 encoded JSON objects in the form:
{
"event": "some-event-name",
"request_id": 0,
"channel_id": "uie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": {}
}
where:
Field | Type | Possible Values | Description |
---|---|---|---|
event | str | see for each command below | event name |
request_id | int | any request_id already received from the client | reference to the request this event responds to |
channel_id | str | unique identifier | session identifier set by the server |
completion_cause | str / null | event dependent | optional: complementary status of event |
completion_reason | str / null | free-form explanation message | |
headers | map | event dependent | event specific attributes |
body | object | event dependent | event payload |
In the definitions below, we won't repeat the request_id
or channel_id
and empty fields are omitted.
Open a session
- command:
OPEN
- headers:
custom_id
: string, freely set by the client (to identify it's own client for example)audio_codec
: string, to set the codec of the streamed audio:"linear"
: raw PCM 16 bit signed little endian samples at 8khz rate;"g711a"
: G711 a-law at 8khz sample rate;"g711u"
: G711 μ-law at 8khz sample rate.
The custom_id
sent in the header is reproduced unchanged on usage reports.
In the future we may add more session attributes like, for example, the audio codec.
Success response
- event:
OPENED
- channel_id: string, the identifier given to this session by the server.
Error responses
- event =
METHOD-NOT-VALID
when a session is already opened (this event has nochannel_id
) - event =
INVALID-PARAM-VALUE
in case of JSON schema error (this event has norequest_id
, nochannel_id
) - event =
METHOD-FAILED
for other errors with:- completion_cause =
Error
- completion_reason = the actual error explanation
- completion_cause =
In case of a METHOD-FAILED
response, the channel must still be closed by the client.
Stream audio
Audio samples are sent as WS binary messages. Unless a codec is set when opening a session (in future iterations), the audio must be in raw PCM format (no headers, no attributes, just raw audio): frames must be 16 bits signed little endian integers sampled at 8khz, mono only.
For efficiency on WS, the samples should be at least 50 milliseconds long, i.e. 400 frames or 800 bytes (in linear, 400 bytes in G711), but less than 100 ms to keep latency low. So if you are converting an RTP stream which transmits 10 to 20 ms packets you'll have to buffer them — or better yet, use the MRCP API instead of this WS API. That's because WS (TCP only) has more overhead than datagram based protocols.
The server won't return any acknowledgement upon receiving audio packets.
If you send truncated frames, i.e. an odd number of bytes, the server will close the session:
- event:
CLOSED
- completion_cause:
Error
- completion_reason:
truncated frame in audio packet
Set session defaults
Set session default values for the recognition parameters.
This command is optional. Use it if you don't want to set and repeat the recognition parameters on each individual RECOGNIZE
request. It's a matter of taste and client implementation.
- command:
SET-PARAMS
- headers are any subset (all optional) of:
Field | Type | Possible Values or unit | Description |
---|---|---|---|
no_input_timeout | int | milliseconds | When recognition is started and there is no speech detected for this period of time, the recognizer can send a RECOGNITION-COMPLETE event to the client with a completion_cause of NoInputTimeout |
speech_complete_timeout | int | (normal mode only) ms | The speech-complete-timeout value applies when the recognizer currently has a complete match against an active grammar, and specifies how long the recognizer MUST wait for more input before declaring a match. |
speech_incomplete_timeout | int | (normal mode only) ms | The incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars. In this case, once the timeout is triggered, the partial result is returned with a completion_cause of PartialMatch . |
speech_nomatch_timeout | int | (normal mode only) ms | The nomatch timeout applies when the speech prior to the silence doesn't match any of the active grammars. In this case, once the timeout is triggered, the transcript speech input is returned without interpretation and with a completion_cause of NoMatch . |
hotword_min_duration | int | (hotword only) ms | It specifies the minimum duration of an utterance that will be considered for hotword recognition |
hotword_max_duration | int | (hotword only) ms | It specifies the maximum duration of an utterance that will be considered for hotword recognition |
recognition_timeout | int | ms | when recognition is started and there is no match for a certain period of time, the recognizer can send a RECOGNITION-COMPLETE event to the client and terminate the recognition operation |
confidence_threshold | float | 0 ≤ l ≤ 1 | this field tells the recognizer resource what confidence level the client considers a successful match |
speech_language | str | RFC 5646 | we only support fr, fr-FR, en, en-US, en-GB for the moment, plus some business domain extensions |
Unknown headers are just ignored without error.
In hotword mode, there is a subtle difference between recognition_timeout
and hotword_max_duration
: The recognition_timeout
timer is reset on each silence whereas the hotword_max_duration
timer is not.
In Normal mode, the recognition_timeout
timer is not reset on silences!
Success response
- event:
PARAMS-SET
Error responses
event:
METHOD-FAILED
- completion_cause:
LanguageUnsupported
- completion_reason contains details
- completion_cause:
event:
INVALID-PARAM-VALUE
- we were unable to parse the JSON payload (wrong types or syntax error)
- completion_cause:
Error
, - completion_reason: detailed error message
- as the command was not decoded, the
request_id
header of this response may be0
, and thechannel_id
could be empty.
Why two different events? In the first case, the value (language tag) is valid but not (currently) supported vs invalid values that will never be correct. For exemple "speech_language": "ar‑SA"
would raise a METHOD-FAILED
but "speech_language": 78.6
would raise an INVALID-PARAM-VALUE
.
Get session defaults
Get session default values of the recognition parameters.
- command
GET-PARAMS
Success response
- event:
DEFAULT-PARAMS
- headers: all headers from
SET-PARAMS
at once
- headers: all headers from
Error responses
None, this command never fails.
Define a grammar alias
- command:
DEFINE-GRAMMAR
- headers:
Field | Type | Possible Values or unit | Description |
---|---|---|---|
content_id | str | ASCII string | the alias id you want to define (without the session: prefix) |
content_type | str | "text/uri-list" | The type on the body. Only "text/uri-list" is supported for the moment |
- body: the grammar you want to alias. Examples:
builtin:speech/address
builtin:speech/keywords?alternatives=facture|commande|compte|conseiller
builtin:speech/spelling/mixed?regex=[a-z]{2}[0-9]{9}[a-z]{2}
Success response
- event:
GRAMMAR-DEFINED
Error responses
event:
METHOD_FAILED
- completion_cause:
GramDefinitionFailure
orGramLoadFailure
- completion_reason: the explanatory error message
- completion_cause:
event:
METHOD_NOT_VALID
if a recognition process is in progressevent:
MISSING_PARAM
ifcontent-id
is missing
Start recognition
- command:
RECOGNIZE
- headers: the same as
SET-PARAMS
plus:
Field | Type | Possible Values or unit | Description |
---|---|---|---|
recognition_mode | str | "normal" / "hotword" | the recognition mode |
start_input_timers | bool | a value of false tells the recognizer to start recognition but not to start the no-input timer yet. Default is false. | |
content_type | str | "text/uri-list" | The type on the body. Only "text/uri-list" is supported for the moment |
If the client chooses not to start the timers immediately, it should issue a START-INPUT-TIMERS
command later.
- body (required): multi-line string, contains the grammar to use, one per line (order is significant for priorities).
The grammar references may be builtin grammars (builtin:…
) or aliases prefixed by session:
.
If more than one grammar is given, the server will try to match them all at the same time. The first matching grammar "wins". If several grammars match at once, the one that is earlier in the list has priority.
Future recognition events will reference this request in their request_id
.
Success response
- event:
RECOGNITION-IN-PROGRESS
Error responses
- event:
METHOD_FAILED
- completion_cause:
GramLoadFailure
orError
(for ASR errors or when a recognition request is already progressing) - completion_reason: the explanatory error message
- completion_cause:
Asynchronous Recognition events
"start of input" event
In normal mode only, when the recognition process starts hearing a voice, this event is fired:
- event:
START-OF-INPUT
Outcome event
Some time later, when the recognition completes, a RECOGNITION-COMPLETE
event is fired, with a completion_cause
header that may be one of:
Success
: at least one of the grammar matchedNoInputTimeout
: no voice was heard untilno_input_timeout
expiredNoMatch
: in normal mode, what was "heard" contradicts all grammars or confidence was too low to accept matchNoMatchMaxtime
: no match was found beforerecognition_timeout
was reached in normal modeHotwordMaxtime
: no match was found beforerecognition_timeout
orhotword_max_duration
was reached in hotword modeTooMuchSpeechTimeout
: a match was found, but new speech was still matching whenrecognition_timeout
orhotword_max_duration
expired (the match is returned)PartialMatch
: (normal mode only) only a partial match was foundPartialMatchMaxtime
: (normal mode only) only a partial was found and new speech continued to partial match the grammar untilrecognition_timeout
orhotword_max_duration
expired (the partial match is returned)
The completion_reason
header may give additional justification.
The body contains the Recognition Result JSON object.
Start input timers
- command:
START-INPUT-TIMERS
Success response
- event:
INPUT-TIMERS-STARTED
Error responses
None
Stop ongoing recognition
- command:
STOP
Success response
- event:
STOPPED
- headers:
- active_request_id: the id of the
RECOGNIZE
request that was cancelled
- active_request_id: the id of the
Error responses
If there is no recognition process in progress, the command is simply ignored.
Close session
- command:
CLOSE
Success response
- event:
CLOSED
Error responses
- event:
METHOD_INVALID
if no session was open.
Grammars
We support the following builtin grammars:
builtin:grammar/none
builtin:speech/none
builtin:speech/address
builtin:speech/address?struct
: return a structured address as XMLbuiltin:/speech/boolean
to match whether the speaker agrees or not. The interpretation returns "yes" or "no".builtin:speech/transcribe
builtin:speech/text2num
builtin:speech/spelling/mixed
builtin:speech/spelling/digits
builtin:speech/spelling/letters
builtin:speech/spelling/mixed?regex=
+ pattern (the interpretation returns the match as a single word) — partial matches are not supportedbuiltin:speech/spelling/digits?regex=
+ pattern (the interpretation returns the match as a single word) — partial matches are not supportedbuiltin:speech/spelling/letters?regex=
+ pattern (the interpretation returns the match as a single word) — partial matches are not supportedbuiltin:speech/spelling/mixed?length=
+ integer (forces interpretation as a single word of the given length)builtin:speech/spelling/digits?length=
+ integer (forces interpretation as a single word of the given length)builtin:speech/spelling/letters?length=
+ integer (forces interpretation as a single word of the given length)builtin:speech/spelling/zipcode
builtin:speech/zipcode
(alias tobuiltin:speech/spelling/zipcode
) — beware that this builtin is not universal, it only recognize 5 digit zipcodes that are splittable as 2+3 digits. We recommend you usebuiltin:speech/spelling/digits?length=
+ integer for such applications instead.builtins:/speech/keywords=
+<alternatives>
where<alternative>
is a list of words or expressions separated by|
. The grammar matches when one of the listed keyword is found, and the returned interpretation is the found keyword.
You'll find more details on the MRCP page related to grammars.
See the MRCP Recognizer documentation. Instead of XML we return a recognition result object:
Recognition result
Field | Type | Possible Values or unit | Description |
---|---|---|---|
asr | object | Transcript object | The result of the speech to text transcription |
nlu | object | Interpretation object | Interpreted result (what the engine understood) as structured data |
grammar_uri | str | grammar URI | the grammar that matched (as specified in the RECOGNIZE request) |
Depending of the completion_cause
, some or all fields may be empty (null
or ""
).
Transcript
Field | Type | Possible Values or unit | Description |
---|---|---|---|
transcript | str | UTF-8 | The raw ASR transcript |
confidence | float | 0 ≤ c ≤ 1 | The ASR transcript confidence |
start | int | unix timestamp in ms | the start of the transcript timestamp |
end | int | unix timestamp in ms | the end of the transcript timestamp |
Interpretation
Field | Type | Possible Values or unit | Description |
---|---|---|---|
type | URI | builtin Grammar URI | The actual buitin grammar that matched (once all aliases were resolved, without the query part) |
value | grammar dependent | grammar dependent | the actual semantic interpretation |
confidence | float | 0 ≤ c ≤ 1 | The interpretation confidence |
All currently available grammars return a string as value except:
- the boolean grammar that returns a boolean value;
- the address grammar that returns an address object:
{
"type": "builtin:speech/address",
"value":
{
"number": "37",
"street": "rue du docteur leroy",
"zipcode": "72000",
"city": "le mans"
},
"confidence": 0.9
}
In any case, the client should check the type
to know how to handle the interpretation, even if it is a plain string.
Complete example
The connection is already granted.
Session initiation
A new user is calling the bot, so the bot opens a new Samosa session and sends:
{
"command": "OPEN",
"request_id": 0,
"channel_id": "test",
"headers": {
"custom_id": "blueprint"
},
"body": ""
}
the command is successful and the server reponds
{
"event": "OPENED",
"request_id": 0,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}
The client starts streaming. Audio packet are omitted here.
Prepare the grammar (the NLU) to interpret the first user answer
The bot expects a French car plate number. For clarity and re-usability, the bot developers define an alias for this, creating a custom grammar, called immat
, out of the builtin builtin:speech/spelling/mixed
grammar configured with a custom REGEX telling the NLU what a French car plate looks like:
{
"command": "DEFINE-GRAMMAR",
"request_id": 1,
"channel_id": "testuie46e4ui6",
"headers": {
"content_id": "immat",
"content_type": "text/uri-list"
},
"body": "builtin:speech/spelling/mixed?regex=([a-z]{2}[0-9]{3}[a-z]{2})|([0-9]{4}[a-z]{3}[0-9]{2})"
}
The command is successful and the server confirms:
{
"event": "GRAMMAR-DEFINED",
"request_id": 1,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}
Set some defaults values
To keep the recognition command "light", the bot developers set the recognition language and confidence threshold once for the duration of the session:
{
"command": "SET-PARAMS",
"request_id": 2,
"channel_id": "testuie46e4ui6",
"headers": {
"speech_language": "fr",
"confidence_threshold": 0.7
},
"body": ""
}
The command is successful, the server responds
{
"event": "PARAMS-SET",
"request_id": 2,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}
Start the recognition
The bot asks the user to spell their car plate number and then instructs the Samosa session to listen to the user and interpret what they are saying:
{
"command": "RECOGNIZE",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"headers": {
"recognition_mode": "normal",
"no_input_timout": 5000,
"recognition_timeout": 30000,
"speech_complete_timeout": 800,
"speech_incomplete_timeout": 1500,
"speech_nomatch_timeout": 3000,
"content_type": "text/uri-list"
},
"body": "session:immat"
}
Instead of session:immat
, the bot devs could have decided not to define the alias and to directly use the builtin:speech/spelling/mixed?regex=([a-z]{2}[0-9]{3}[a-z]{2})|([0-9]{4}[a-z]{3}[0-9]{2})
expression here.
The process is successfully started and the server responds:
{
"event": "RECOGNITION-IN-PROGRESS",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"completion_cause": "Success",
"completion_reason": null,
"headers": {},
"body": ""
}
Receive results
The bots now waits for more events from the server, while the user is speaking — hopefully spelling their car plate number, among other things (the NLU is quite robust).
As we are using normal
mode, as soon as the user's voice is detected, the server sends the START-OF-INPUT
event:
{
"event": "START-OF-INPUT",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"completion_cause": null,
"completion_reason": null,
"headers": {},
"body": ""
}
Sometime later, when the user is done speaking, the server returns what it has understood (in this case, a success):
{
"event": "RECOGNITION-COMPLETE",
"request_id": 3,
"channel_id": "testuie46e4ui6",
"completion_cause": "Success",
"completion_reason": null,
"headers": {},
"body": {
"asr": {
"transcript": "attendez alors voilà baissé trois cent cinq f z",
"confidence": 0.9,
"start": 1629453934909,
"end": 1629453944833
},
"nlu": {
"type": "builtin:speech/spelling/mixed",
"value": "bc305fz",
"confidence": 0.86
},
"grammar_uri": "session:immat"
}
}