The Live API protocol V1

This is the legacy protocol supported by our Live API. Developers are strongly advised to use the protocol V2 instead

You should read the overview first to understand how the API works.

In this article, we describe the protocol spoken over websockets.

Authentication

When you subscribed to the service, we gave you two credentials:

  • an organization identifier.
  • an individual authentication token issued within your organization.

The token is used to grant you access to the service when you open the websocket connection. Each service, application or user within your organization should have its own token, but they share the same organization identifier.

The organization identifier allows you to create and/or join conversations in your organization space. Only members of the organization can do that. That's why the organization id you give must match the one coded into your token.

The websocket URL

wss://api.uh.live/socket/websocket?token=YOUR_TOKEN

To connect to the API, you must provide an organization token YOUR_TOKEN. It was provided to you, along with your organization identifier, when you subscribed to the service.

The server sets a timeout of 60s on all the websocket connections. If you expect little activity, you should use websocket pings to maintain the connection open (heartbeat).

JSON format for messages and events

The format is the same for messages sent by the client, as for the events published by the server:

{
"topic": "...",
"event": "...",
"payload": {},
"ref": 0
}

In which:

  • topic is a string identifying the conversation (see below Joining a conversation) and must be present on all messages and events to/from the conversation.
  • event is a string indicating the type of the event or message.
  • payload is an object representing the specific payload of the event/message.
  • ref is an integer incremented by the client each time it sends a new message.

As the protocol is fully asynchronous, a client should not wait for server confirm events before sending its next messages. The ref is a reconciliation key the client can use later to identify which message a confirm event respond to.

For the sake of simplicity, all examples use a ref of 0.

Joining a conversation

Scenario: your organization identifier is acme_corp and you want to join a conversation called conference as active speaker Alice.

You need to send a phx_join message:

{
"topic": "conversation:acme_corp@conference",
"event": "phx_join",
"payload": {
"speaker": "Alice",
"readonly": false,
"model": "fr",
"country": "fr",
"interim_results": true,
"rescoring": true,
"origin": 1614099879211
},
"ref": 0
}

Please note how the topic is formated: conversation:<organization identifier>@<conversation name>.

If you wanted to join the conversation as an observer, you would set the readonly flag to true.

As a conversation is private to your organization, the <organization identifier> part is mandatory and checked against your token to grant access. Your organization identifier was given to you with your token when you subscribed.

Then follow the ASR parameters, that apply only if readonly is false.

If the join was successful then you get a response like this:

{
"topic": "conversation:acme_corp@conference",
"ref": 0,
"payload": {
"status": "ok",
"response": {}
},
"join_ref": null,
"event": "phx_reply"
}

If the join was not successful then you get "error" status instead of "ok" and a detailed error message in payload.response.reason.

Sending an audio chunk

Scenario: you successfully joined the conversation conference as active speaker Alice and you want to stream audio.

Streaming audio is as simple as sending a stream of audio chunk messages as you get them from your audio source (audio card, network, file).

There is a limit of 64KB on a chunk size.

You send an audio chunk as a audio_chunk message:

{
"topic": "conversation:acme_corp@conference",
"event": "audio_chunk",
"payload": {
"blob": "..."
},
"ref": 0
}

As we use JSON, the audio chunk is base64 encoded (blob in the payload).

The source audio must be encoded in raw, 16 bit signed little endian samples at 8khz rate. So an audio chunk must contain an even number of bytes.

You may receive an asynchronous error message in case of error during the streaming, but the server won't acknowledge each packet individually.

An active speaker should continuously stream audio, immediately after joining. If there is no audio for 5 minutes, the speaker is forced to leave the conversation. So you are advised to stream silence when the speaker is muted. This timeout is independent from the websocket timeout. It is an application timeout that exclusively apply to audio, to release decoding resources if they are not used. By streaming continuously, including silence, you can easily prevent both timeouts (websocket and audio) from firing.

This API is optimized for live audio processing, please don't abuse it for batch transcription of audio files: your decoding resources will be throttled. We offer dedicated APIs for batch processing. Please contact our commercial support.

Leaving a conversation

You will automatically leave the room if you disconnect your connection, but you can also explicitly leave the room without closing your connection by using the phx_leave message:

{
"topic": "conversation:acme_corp@conference",
"event": "phx_leave",
"payload": {},
"ref": 0
}

You will then receive a server confirm, and later, a speaker_left event when all the audio has been processed.

You are encouraged to explicitly leave a conversation and wait for your speaker_left event (see below) before disconnecting. Like that, you are sure to not miss any recognition event of yours.

Recognition events

As the transcription process progresses all participants will receive recognition events of type words_decoded for interim results, and segment_decoded for definitive segment transcripts. An utterance_id is provided to identify related recognition events.

In the same scenario as before, those events looks like this (only the event type changes between the two):

{
"event": "segment_decoded",
"payload": {
"lang": "fr",
"speaker": "Alice",
"confidence": 0.999994,
"end": 1614174590704,
"length": 1110,
"start": 1614174589594,
"transcript": "je vous entends très bien",
"utterance_id": 294,
"words": [
{
"confidence": 0.99997,
"end": 1614174589804,
"length": 210,
"start": 1614174589594,
"word": "je"
},
{
"confidence": 1.0,
"end": 1614174589954,
"length": 150,
"start": 1614174589804,
"word": "vous"
},
{
"confidence": 1.0,
"end": 1614174590224,
"length": 270,
"start": 1614174589954,
"word": "entends"
},
{
"confidence": 1.0,
"end": 1614174590434,
"length": 210,
"start": 1614174590224,
"word": "très"
},
{
"confidence": 1.0,
"end": 1614174590704,
"length": 270,
"start": 1614174590434,
"word": "bien"
}
]
},
"ref": null,
"topic": "conversation:acme_corp@conference"
}

The timestamps of the segment and words are integers in millisecond, Unix time computed since the beginning of the audio according to the provided origin. Time follows the audio clock. So if you want to keep realtime timings, you should send silence when a speaker is muted instead of suspending the audio stream.

Other events

To learn about Enrich events, please check the dedicated documentation page.

Speaker joined

When a speaker joins the conversation, a speaker_joined event is broadcast to all other participants:

{
"event": "speaker_joined",
"payload": {
"interim_results": true,
"rescoring": true,
"speaker": "Alice",
"timestamp": 1614176526425
},
"ref": null,
"topic": "conversation:acme_corp@conference"
}

Speaker left

When a speaker leaves the conversation, a speaker_left event is broadcast to all participants (including the one who is leaving):

{
"topic": "conversation:acme_corp@conference",
"event": "speaker_left",
"payload": {
"speaker": "Alice",
"timestamp": 1614174592000
},
"ref": null
}

Leaving users also receives their own speaker_left event so that they know when all their audio have been processed and that they may safely disconnect without losing decoding events.