The Live API protocol V1
This is the legacy protocol supported by our Live API. Developers are strongly advised to use the protocol V2 instead
You should read the overview first to understand how the API works.
In this article, we describe the protocol spoken over websockets.
Authentication
When you subscribed to the service, we gave you two credentials:
- an organization identifier.
- an individual authentication token issued within your organization.
The token is used to grant you access to the service when you open the websocket connection. Each service, application or user within your organization should have its own token, but they share the same organization identifier.
The organization identifier allows you to create and/or join conversations in your organization space. Only members of the organization can do that. That's why the organization id you give must match the one coded into your token.
The websocket URL
wss://api.uh.live/socket/websocket?token=YOUR_TOKEN
To connect to the API, you must provide an organization token YOUR_TOKEN
. It was provided to you, along with your organization identifier, when you subscribed to the service.
The server sets a timeout of 60s on all the websocket connections. If you expect little activity, you should use websocket ping
s to maintain the connection open (heartbeat).
JSON format for messages and events
The format is the same for messages sent by the client, as for the events published by the server:
{
"topic": "...",
"event": "...",
"payload": {},
"ref": 0
}
In which:
topic
is a string identifying the conversation (see below Joining a conversation) and must be present on all messages and events to/from the conversation.event
is a string indicating the type of the event or message.payload
is an object representing the specific payload of the event/message.ref
is an integer incremented by the client each time it sends a new message.
As the protocol is fully asynchronous, a client should not wait for server confirm events before sending its next messages. The ref
is a reconciliation key the client can use later to identify which message a confirm event respond to.
For the sake of simplicity, all examples use a ref
of 0
.
Joining a conversation
Scenario: your organization identifier is acme_corp
and you want to join a conversation called conference
as active speaker Alice
.
You need to send a phx_join
message:
{
"topic": "conversation:acme_corp@conference",
"event": "phx_join",
"payload": {
"speaker": "Alice",
"readonly": false,
"model": "fr",
"country": "fr",
"interim_results": true,
"rescoring": true,
"origin": 1614099879211
},
"ref": 0
}
Please note how the topic
is formated: conversation:<organization identifier>@<conversation name>
.
If you wanted to join the conversation as an observer, you would set the readonly
flag to true
.
As a conversation is private to your organization, the <organization identifier>
part is mandatory and checked against your token to grant access. Your organization identifier was given to you with your token when you subscribed.
Then follow the ASR parameters, that apply only if readonly
is false.
If the join was successful then you get a response like this:
{
"topic": "conversation:acme_corp@conference",
"ref": 0,
"payload": {
"status": "ok",
"response": {}
},
"join_ref": null,
"event": "phx_reply"
}
If the join was not successful then you get "error"
status instead of "ok"
and a detailed error message in payload.response.reason
.
Sending an audio chunk
Scenario: you successfully joined the conversation conference
as active speaker Alice
and you want to stream audio.
Streaming audio is as simple as sending a stream of audio chunk messages as you get them from your audio source (audio card, network, file).
There is a limit of 64KB on a chunk size.
You send an audio chunk as a audio_chunk
message:
{
"topic": "conversation:acme_corp@conference",
"event": "audio_chunk",
"payload": {
"blob": "..."
},
"ref": 0
}
As we use JSON, the audio chunk is base64 encoded (blob
in the payload
).
The source audio must be encoded in raw, 16 bit signed little endian samples at 8khz rate. So an audio chunk must contain an even number of bytes.
You may receive an asynchronous error message in case of error during the streaming, but the server won't acknowledge each packet individually.
An active speaker should continuously stream audio, immediately after joining. If there is no audio for 5 minutes, the speaker is forced to leave the conversation. So you are advised to stream silence when the speaker is muted. This timeout is independent from the websocket timeout. It is an application timeout that exclusively apply to audio, to release decoding resources if they are not used. By streaming continuously, including silence, you can easily prevent both timeouts (websocket and audio) from firing.
This API is optimized for live audio processing, please don't abuse it for batch transcription of audio files: your decoding resources will be throttled. We offer dedicated APIs for batch processing. Please contact our commercial support.
Leaving a conversation
You will automatically leave the room if you disconnect your connection, but you can also explicitly leave the room without closing your connection by using the phx_leave
message:
{
"topic": "conversation:acme_corp@conference",
"event": "phx_leave",
"payload": {},
"ref": 0
}
You will then receive a server confirm, and later, a speaker_left
event when all the audio has been processed.
You are encouraged to explicitly leave a conversation and wait for your speaker_left
event (see below) before disconnecting. Like that, you are sure to not miss any recognition event of yours.
Recognition events
As the transcription process progresses all participants will receive recognition events of type words_decoded
for interim results, and segment_decoded
for definitive segment transcripts. An utterance_id
is provided to identify related recognition events.
In the same scenario as before, those events looks like this (only the event
type changes between the two):
{
"event": "segment_decoded",
"payload": {
"lang": "fr",
"speaker": "Alice",
"confidence": 0.999994,
"end": 1614174590704,
"length": 1110,
"start": 1614174589594,
"transcript": "je vous entends très bien",
"utterance_id": 294,
"words": [
{
"confidence": 0.99997,
"end": 1614174589804,
"length": 210,
"start": 1614174589594,
"word": "je"
},
{
"confidence": 1.0,
"end": 1614174589954,
"length": 150,
"start": 1614174589804,
"word": "vous"
},
{
"confidence": 1.0,
"end": 1614174590224,
"length": 270,
"start": 1614174589954,
"word": "entends"
},
{
"confidence": 1.0,
"end": 1614174590434,
"length": 210,
"start": 1614174590224,
"word": "très"
},
{
"confidence": 1.0,
"end": 1614174590704,
"length": 270,
"start": 1614174590434,
"word": "bien"
}
]
},
"ref": null,
"topic": "conversation:acme_corp@conference"
}
The timestamps of the segment and words are integers in millisecond, Unix time computed since the beginning of the audio according to the provided origin. Time follows the audio clock. So if you want to keep realtime timings, you should send silence when a speaker is muted instead of suspending the audio stream.
Other events
To learn about Enrich events, please check the dedicated documentation page.
Speaker joined
When a speaker joins the conversation, a speaker_joined
event is broadcast to all other participants:
{
"event": "speaker_joined",
"payload": {
"interim_results": true,
"rescoring": true,
"speaker": "Alice",
"timestamp": 1614176526425
},
"ref": null,
"topic": "conversation:acme_corp@conference"
}
Speaker left
When a speaker leaves the conversation, a speaker_left
event is broadcast to all participants (including the one who is leaving):
{
"topic": "conversation:acme_corp@conference",
"event": "speaker_left",
"payload": {
"speaker": "Alice",
"timestamp": 1614174592000
},
"ref": null
}
Leaving users also receives their own speaker_left
event so that they know when all their audio have been processed and that they may safely disconnect without losing decoding events.