The Stream API protocol (V2)
You should read the overview first to understand how the API works.
In this article, we describe the protocol spoken over websockets (WS).
We use WS text messages with JSON content for all commands and events except for the binary audio data, which is streamed in WS binary messages for efficiency (no more base 64 encoding as opposed to version 1 of the protocol).
Authentication
Authentication to the Stream H2H WebSocket API requires a client_id
and client_secret
, provided by your account manager.
If you're using our Python SDK, you'll find more details on how to pass these value into it and start consuming the API.
If you're implementing the protocol yourself, you have to follow these steps:
- Request an
access_token
by sending a POST request onhttps://id.uh.live/realms/uhlive/protocol/openid-connect/token
with the following payload:
{
"client_id": "{the client_id that was provided to you}",
"client_secret": "{the client_secret that was provided to you}",
"grant_type": "client_credentials"
}
You'll receive a response with an access_token
, valid for 5 minutes.
You can copy this cURL command, and replace your client_id
and client_secret
to get your access token:
curl -L -X POST 'https://id.uh.live/realms/uhlive/protocol/openid-connect/token' -H 'Content-Type: application/x-www-form-urlencoded' --data-urlencode 'client_id=CLIENT_ID' --data-urlencode 'grant_type=client_credentials' --data-urlencode 'client_secret=CLIENT_SECRET' --data-urlencode 'scope=openid'
- Pass the
access_token
in as query parameterjwt={access_token}
when initiating the WebSocket connection.
Of course the WebSocket connection can last longer than the 5 minutes validity of the access_token
. You'll have 5 minutes after requesting it to initiate the connection.
The websocket URL
wss://api.uh.live/socket/websocket?vsn=2.0.0&jwt=YOUR_ACCESS_TOKEN
The server sets a timeout of 60 seconds on all the websocket connections. If you expect little activity, you should use websocket ping
s to maintain the connection open (heartbeat).
JSON format for messages and events
The format is the same for messages sent by the client, as for the events published by the server. It's a list:
[join_ref, ref, topic, event, payload]
In which:
ref
is unique string reference changed by the client each time it sends a new message.join_ref
: is the reference given when joining a conversation.topic
is a string identifying the conversation (see below Joining a conversation) and must be present on all messages and events to/from the conversation.event
is a string indicating the type of the event or message.payload
is an object representing the specific payload of the event/message.
As the protocol is fully asynchronous, a client should not wait for server confirm events before sending its next messages. The ref
is a reconciliation key the client can use later to identify which message a confirm event respond to.
For the sake of simplicity, all examples use a ref
of 0
or 1
and join_ref
of 0
.
Joining a conversation
Scenario: your organization identifier is acme_corp
and you want to join a conversation called conference
as active speaker Alice
.
You need to send a phx_join
message:
[
"0",
"0",
"conversation:acme_corp@conference",
"phx_join",
{
"speaker": "Alice",
"readonly": false,
"audio_codec": "linear",
"model": "fr",
"country": "fr",
"interim_results": true,
"rescoring": true,
"origin": 1614099879211
}
]
The ref
and join_ref
must be the same on that command. All subsequent messages you'll send must have the same join_ref
but a different ref
each time.
Remember that this API is asynchronous. So don't wait for a response after joining before starting streaming your audio.
Please note how the topic
is formatted: conversation:<organization identifier>@<conversation name>
.
If you wanted to join the conversation as an observer, you would set the readonly
flag to true
. Otherwise set it to false
and declare the audio codec you will use. If the audio codec is not specified, it defaults to linear
. The possible codecs are:
"linear"
: raw PCM 16 bit signed little endian samples at 8khz rate;"g711a"
: G711 a-law at 8khz sample rate;"g711u"
: G711 μ-law at 8khz sample rate.
See Sending audio for more details.
As a conversation is private to your organization, the <organization identifier>
part is mandatory and checked against your token to grant access. Your organization identifier was given to you with your token when you subscribed.
Then follow the ASR parameters, that apply only if readonly
is false.
If the join was successful then you get a response like this:
[
"0",
"0",
"conversation:acme_corp@conference",
"phx_reply",
{
"status": "ok",
"response": {}
}
]
If the join was not successful then you get "error"
status instead of "ok"
and a detailed error message in payload.response.reason
.
Sending a binary audio chunk
Scenario: you successfully joined the conversation conference
as active speaker Alice
and you want to stream audio.
Streaming audio is as simple as sending a stream of audio chunk messages as you get them from your audio source (audio card, network, file).
There is a length limit of 4 seconds on a chunk size.
You send an audio chunk as a binary packed audio_chunk
message. The binary format is:
8 bits | 8 bits | 8 bits | 8 bits | 8 bits | utf-8 string | utf-8 string | utf-8 string | utf-8 string | bytes |
---|---|---|---|---|---|---|---|---|---|
0 | join_ref size | ref size | topic size | 11 | join_ref | ref | topic | "audio_chunk" | audio chunk data |
The first byte is always 0
.
The source audio must be encoded according to the declared codec:
"linear"
: in raw, 16 bit signed little endian samples at 8khz rate. So an audio chunk must contain an even number of bytes."g711a"
: G711 a-law at 8khz sample rate;"g711u"
: G711 μ-law at 8khz sample rate.
You may receive an asynchronous error message in case of error during the streaming, but the server won't acknowledge each packet individually.
An active speaker should continuously stream audio, immediately after joining. If there is no audio for 5 minutes, the speaker is forced to leave the conversation. So you are advised to stream silence when the speaker is muted. This timeout is independent from the websocket timeout. It is an application timeout that exclusively apply to audio, to release decoding resources if they are not used. By streaming continuously, including silence, you can easily prevent both timeouts (websocket and audio) from firing.
This API is optimized for real-time audio processing, please don't abuse it for batch transcription of audio files: your decoding resources will be throttled. We offer dedicated APIs for batch processing. Please contact our commercial support.
Leaving a conversation
You will automatically leave the room if you disconnect your connection, but you can also explicitly leave the room without closing your connection by using the phx_leave
message:
[
"0",
"1",
"conversation:acme_corp@conference",
"phx_leave",
{}
]
You will then receive a server confirm, and later, a speaker_left
event when all the audio has been processed.
You are encouraged to explicitly leave a conversation and wait for your speaker_left
event (see below) before disconnecting. Like that, you are sure to not miss any recognition event of yours.
Recognition events
As the transcription process progresses all participants will receive recognition events of type words_decoded
for interim results, and segment_decoded
for definitive segment transcripts. An utterance_id
is provided to identify related recognition events.
In the same scenario as before, those events looks like this (only the event
type changes between the two):
[
"0",
"1",
"conversation:acme_corp@conference",
"segment_decoded",
{
"lang": "fr",
"speaker": "Alice",
"confidence": 0.999994,
"end": 1614174590704,
"length": 1110,
"start": 1614174589594,
"transcript": "je vous entends très bien",
"utterance_id": 294,
"words": [
{
"confidence": 0.99997,
"end": 1614174589804,
"length": 210,
"start": 1614174589594,
"word": "je"
},
{
"confidence": 1.0,
"end": 1614174589954,
"length": 150,
"start": 1614174589804,
"word": "vous"
},
{
"confidence": 1.0,
"end": 1614174590224,
"length": 270,
"start": 1614174589954,
"word": "entends"
},
{
"confidence": 1.0,
"end": 1614174590434,
"length": 210,
"start": 1614174590224,
"word": "très"
},
{
"confidence": 1.0,
"end": 1614174590704,
"length": 270,
"start": 1614174590434,
"word": "bien"
}
]
},
]
The timestamps of the segment and words are integers in millisecond, Unix time computed since the beginning of the audio according to the provided origin. Time follows the audio clock. So if you want to keep realtime timings, you should send silence when a speaker is muted instead of suspending the audio stream.
Other events
To learn about Enrich events, please check the dedicated documentation page.
Speaker joined
When a speaker joins the conversation, a speaker_joined
event is broadcast to all other participants:
[
"0"
"1",
"conversation:acme_corp@conference",
"speaker_joined",
{
"interim_results": true,
"rescoring": true,
"speaker": "Alice",
"timestamp": 1614176526425
},
]
Speaker left
When a speaker leaves the conversation, a speaker_left
event is broadcast to all participants (including the one who is leaving):
[
"0",
"1",
"conversation:acme_corp@conference",
"speaker_left",
{
"speaker": "Alice",
"timestamp": 1614174592000
},
]
Leaving users also receives their own speaker_left
event so that they know when all their audio have been processed and that they may safely disconnect without losing decoding events.