The Stream API protocol (V2)

You should read the overview first to understand how the API works.

In this article, we describe the protocol spoken over websockets (WS).

We use WS text messages with JSON content for all commands and events except for the binary audio data, which is streamed in WS binary messages for efficiency (no more base 64 encoding as opposed to version 1 of the protocol).

Authentication

Authentication to the Stream H2H WebSocket API requires a client_id and client_secret, provided by your account manager.

If you're using our Python SDK, you'll find more details on how to pass these value into it and start consuming the API.

If you're implementing the protocol yourself, you have to follow these steps:

  1. Request an access_token by sending a POST request on https://id.uh.live/realms/uhlive/protocol/openid-connect/token with the following payload:
{
"client_id": "{the client_id that was provided to you}",
"client_secret": "{the client_secret that was provided to you}",
"grant_type": "client_credentials"
}

You'll receive a response with an access_token, valid for 5 minutes.

You can copy this cURL command, and replace your client_id and client_secret to get your access token:

curl -L -X POST 'https://id.uh.live/realms/uhlive/protocol/openid-connect/token' -H 'Content-Type: application/x-www-form-urlencoded' --data-urlencode 'client_id=CLIENT_ID' --data-urlencode 'grant_type=client_credentials' --data-urlencode 'client_secret=CLIENT_SECRET' --data-urlencode 'scope=openid'
  1. Pass the access_token in as query parameter jwt={access_token} when initiating the WebSocket connection.

Of course the WebSocket connection can last longer than the 5 minutes validity of the access_token. You'll have 5 minutes after requesting it to initiate the connection.

The websocket URL

wss://api.uh.live/socket/websocket?vsn=2.0.0&jwt=YOUR_ACCESS_TOKEN

The server sets a timeout of 60 seconds on all the websocket connections. If you expect little activity, you should use websocket pings to maintain the connection open (heartbeat).

JSON format for messages and events

The format is the same for messages sent by the client, as for the events published by the server. It's a list:

[join_ref, ref, topic, event, payload]

In which:

  • ref is unique string reference changed by the client each time it sends a new message.
  • join_ref: is the reference given when joining a conversation.
  • topic is a string identifying the conversation (see below Joining a conversation) and must be present on all messages and events to/from the conversation.
  • event is a string indicating the type of the event or message.
  • payload is an object representing the specific payload of the event/message.

As the protocol is fully asynchronous, a client should not wait for server confirm events before sending its next messages. The ref is a reconciliation key the client can use later to identify which message a confirm event respond to.

For the sake of simplicity, all examples use a ref of 0 or 1 and join_ref of 0.

Joining a conversation

Scenario: your organization identifier is acme_corp and you want to join a conversation called conference as active speaker Alice.

You need to send a phx_join message:

[
"0",
"0",
"conversation:acme_corp@conference",
"phx_join",
{
"speaker": "Alice",
"readonly": false,
"audio_codec": "linear",
"model": "fr",
"country": "fr",
"interim_results": true,
"rescoring": true,
"origin": 1614099879211
}
]

The ref and join_ref must be the same on that command. All subsequent messages you'll send must have the same join_ref but a different ref each time.

Remember that this API is asynchronous. So don't wait for a response after joining before starting streaming your audio.

Please note how the topic is formatted: conversation:<organization identifier>@<conversation name>.

If you wanted to join the conversation as an observer, you would set the readonly flag to true. Otherwise set it to false and declare the audio codec you will use. If the audio codec is not specified, it defaults to linear. The possible codecs are:

  • "linear": raw PCM 16 bit signed little endian samples at 8khz rate;
  • "g711a": G711 a-law at 8khz sample rate;
  • "g711u": G711 μ-law at 8khz sample rate.

See Sending audio for more details.

As a conversation is private to your organization, the <organization identifier> part is mandatory and checked against your token to grant access. Your organization identifier was given to you with your token when you subscribed.

Then follow the ASR parameters, that apply only if readonly is false.

If the join was successful then you get a response like this:

[
"0",
"0",
"conversation:acme_corp@conference",
"phx_reply",
{
"status": "ok",
"response": {}
}
]

If the join was not successful then you get "error" status instead of "ok" and a detailed error message in payload.response.reason.

Sending a binary audio chunk

Scenario: you successfully joined the conversation conference as active speaker Alice and you want to stream audio.

Streaming audio is as simple as sending a stream of audio chunk messages as you get them from your audio source (audio card, network, file).

There is a length limit of 4 seconds on a chunk size.

You send an audio chunk as a binary packed audio_chunk message. The binary format is:

8 bits8 bits8 bits8 bits8 bitsutf-8 stringutf-8 stringutf-8 stringutf-8 stringbytes
0join_ref sizeref sizetopic size11join_refreftopic"audio_chunk"audio chunk data

The first byte is always 0.

The source audio must be encoded according to the declared codec:

  • "linear": in raw, 16 bit signed little endian samples at 8khz rate. So an audio chunk must contain an even number of bytes.
  • "g711a": G711 a-law at 8khz sample rate;
  • "g711u": G711 μ-law at 8khz sample rate.

You may receive an asynchronous error message in case of error during the streaming, but the server won't acknowledge each packet individually.

An active speaker should continuously stream audio, immediately after joining. If there is no audio for 5 minutes, the speaker is forced to leave the conversation. So you are advised to stream silence when the speaker is muted. This timeout is independent from the websocket timeout. It is an application timeout that exclusively apply to audio, to release decoding resources if they are not used. By streaming continuously, including silence, you can easily prevent both timeouts (websocket and audio) from firing.

This API is optimized for real-time audio processing, please don't abuse it for batch transcription of audio files: your decoding resources will be throttled. We offer dedicated APIs for batch processing. Please contact our commercial support.

Leaving a conversation

You will automatically leave the room if you disconnect your connection, but you can also explicitly leave the room without closing your connection by using the phx_leave message:

[
"0",
"1",
"conversation:acme_corp@conference",
"phx_leave",
{}
]

You will then receive a server confirm, and later, a speaker_left event when all the audio has been processed.

You are encouraged to explicitly leave a conversation and wait for your speaker_left event (see below) before disconnecting. Like that, you are sure to not miss any recognition event of yours.

Recognition events

As the transcription process progresses all participants will receive recognition events of type words_decoded for interim results, and segment_decoded for definitive segment transcripts. An utterance_id is provided to identify related recognition events.

In the same scenario as before, those events looks like this (only the event type changes between the two):

[
"0",
"1",
"conversation:acme_corp@conference",
"segment_decoded",
{
"lang": "fr",
"speaker": "Alice",
"confidence": 0.999994,
"end": 1614174590704,
"length": 1110,
"start": 1614174589594,
"transcript": "je vous entends très bien",
"utterance_id": 294,
"words": [
{
"confidence": 0.99997,
"end": 1614174589804,
"length": 210,
"start": 1614174589594,
"word": "je"
},
{
"confidence": 1.0,
"end": 1614174589954,
"length": 150,
"start": 1614174589804,
"word": "vous"
},
{
"confidence": 1.0,
"end": 1614174590224,
"length": 270,
"start": 1614174589954,
"word": "entends"
},
{
"confidence": 1.0,
"end": 1614174590434,
"length": 210,
"start": 1614174590224,
"word": "très"
},
{
"confidence": 1.0,
"end": 1614174590704,
"length": 270,
"start": 1614174590434,
"word": "bien"
}
]
},
]

The timestamps of the segment and words are integers in millisecond, Unix time computed since the beginning of the audio according to the provided origin. Time follows the audio clock. So if you want to keep realtime timings, you should send silence when a speaker is muted instead of suspending the audio stream.

Other events

To learn about Enrich events, please check the dedicated documentation page.

Speaker joined

When a speaker joins the conversation, a speaker_joined event is broadcast to all other participants:

[
"0"
"1",
"conversation:acme_corp@conference",
"speaker_joined",
{
"interim_results": true,
"rescoring": true,
"speaker": "Alice",
"timestamp": 1614176526425
},
]

Speaker left

When a speaker leaves the conversation, a speaker_left event is broadcast to all participants (including the one who is leaving):

[
"0",
"1",
"conversation:acme_corp@conference",
"speaker_left",
{
"speaker": "Alice",
"timestamp": 1614174592000
},
]

Leaving users also receives their own speaker_left event so that they know when all their audio have been processed and that they may safely disconnect without losing decoding events.