The Live API protocol (V2)
You should read the overview first to understand how the API works.
In this article, we describe the protocol spoken over websockets (WS).
We use WS text messages with JSON content for all commands and events except for the binary audio data, which is
streamed in WS binary messages for efficiency (no more base 64 encoding as opposed to version 1 of the protocol).
When you subscribed to the service, we gave you two credentials:
- an organization identifier.
- an individual authentication token issued within your organization.
The token is used to grant you access to the service when you open the websocket connection. Each service, application or user within
your organization should have its own token, but they share the same organization identifier.
The organization identifier allows you to create and/or join conversations in your organization space. Only members of the organization can do that.
That's why the organization id you give must match the one coded into your token.
The websocket URL
To connect to the API, you must provide an organization token
YOUR_TOKEN. It was provided to you, along with your
organization identifier, when you subscribed to the service.
The server sets a timeout of 60 seconds on all the websocket connections. If you expect little activity, you should use websocket
pings to maintain the connection open (heartbeat).
JSON format for messages and events
The format is the same for messages sent by the client, as for the events published by the server. It's a list:
[join_ref, ref, topic, event, payload]
refis unique string reference changed by the client each time it sends a new message.
join_ref: is the reference given when joining a conversation.
topicis a string identifying the conversation (see below Joining a conversation) and must be present on all messages and events to/from the conversation.
eventis a string indicating the type of the event or message.
payloadis an object representing the specific payload of the event/message.
As the protocol is fully asynchronous, a client should not wait for server confirm events before sending its next messages.
ref is a reconciliation key the client can use later to identify which message a confirm event respond to.
For the sake of simplicity, all examples use a
Joining a conversation
Scenario: your organization identifier is
acme_corp and you want to join a conversation called
conference as active speaker
You need to send a
join_ref must be the same on that command. All subsequent messages you'll send must have the same
join_ref but a different
ref each time.
Please note how the
topic is formatted:
conversation:<organization identifier>@<conversation name>.
If you wanted to join the conversation as an observer, you would set the
readonly flag to
As a conversation is private to your organization, the
<organization identifier> part is mandatory and checked against your token to grant access. Your organization identifier was given to you with your token when you subscribed.
Then follow the ASR parameters, that apply only if
readonly is false.
If the join was successful then you get a response like this:
If the join was not successful then you get
"error" status instead of
"ok" and a detailed error message in
Sending a binary audio chunk
Scenario: you successfully joined the conversation
conference as active speaker
Alice and you want to stream audio.
Streaming audio is as simple as sending a stream of audio chunk messages as you get them from your audio source (audio card, network, file).
There is a limit of 64KB on a chunk size.
You send an audio chunk as a binary packed
audio_chunk message. The binary format is:
|8 bits||8 bits||8 bits||8 bits||8 bits||utf-8 string||utf-8 string||utf-8 string||utf-8 string||bytes|
|0||11||audio chunk data|
The first byte is always
The source audio must be encoded in raw, 16 bit signed little endian samples at 8khz rate. So an audio chunk must contain an even number of bytes.
You may receive an asynchronous error message in case of error during the streaming, but the server won't acknowledge each packet individually.
An active speaker should continuously stream audio, immediately after joining. If there is no audio for 5 minutes, the speaker is forced to leave the conversation. So you are advised to stream silence when the speaker is muted. This timeout is independent from the websocket timeout. It is an application timeout that exclusively apply to audio, to release decoding resources if they are not used.
By streaming continuously, including silence, you can easily prevent both timeouts (websocket and audio) from firing.
This API is optimized for live audio processing, please don't abuse it for batch transcription of audio files: your decoding resources will be throttled. We offer dedicated APIs for batch processing. Please contact our commercial support.
Leaving a conversation
You will automatically leave the room if you disconnect your connection, but you can also explicitly leave the room without closing your connection by using the
You will then receive a server confirm, and later, a
speaker_left event when all the audio has been processed.
You are encouraged to explicitly leave a conversation and wait for your
speaker_left event (see below) before disconnecting. Like that, you are sure to not miss any recognition event of yours.
As the transcription process progresses all participants will receive recognition events of type
words_decoded for interim results, and
segment_decoded for definitive segment transcripts. An
utterance_id is provided to identify related recognition events.
In the same scenario as before, those events looks like this (only the
event type changes between the two):
"transcript": "je vous entends très bien",
The timestamps of the segment and words are integers in millisecond, Unix time computed since the beginning of the audio according to the provided origin.
Time follows the audio clock. So if you want to keep realtime timings, you should send silence when a speaker is muted instead of suspending the audio stream.
To learn about Enrich events, please check the dedicated documentation page.
When a speaker joins the conversation, a
speaker_joined event is broadcast to all other participants:
When a speaker leaves the conversation, a
speaker_left event is broadcast to all participants (including the one who is leaving):
Leaving users also receives their own
speaker_left event so that they know when all their audio have been processed and that they may safely disconnect without losing decoding events.