Getting started

What are you building here

This Getting Started guides you to build a very simple app, taking audio as
input, and displaying the transcription as output.

You can download the Python script we're going to build here.

Get the Python SDK

The SDK is not available on pip yet, you have to download it from here. It is built for Python 3.6 and higher and can then be installed using pip:

$ pip install -U uhlive-0.12.0-py3-none-any.whl

Connect to the server

First things first, you need to initiate a connection to our servers. Our API URL is: wss://api.uh.live

Make sure you have valid credentials (token and identifier), otherwise you'll receive an error. We recommend you pass credentials in as environment variables, or persist them in a database that is accessed at runtime. You can add a token to the environment by starting your app as:

UHLIVE_API_TOKEN="some-token" UHLIVE_API_ID="your_identifier" UHLIVE_API_URL="wss://api.uh.live" python myapp.py

Here is an example of connecting to the API:

# connect with SDK
import os
from uhlive import Client
from uhlive.events import *

uhlive_url = os.environ["UHLIVE_API_URL"]
uhlive_id = os.environ["UHLIVE_API_ID"]
uhlive_token = os.environ["UHLIVE_API_TOKEN"]
client = Client(url=uhlive_url, identifier=uhlive_id, token=uhlive_token)
client.connect()

Join a conversation

With a single connection, you can join several conversations. Think of a conversation as a conference call instance, or a room on a chat service. Only people who have joined the conversation can access the exchanged data. For this example, we'll use a conversation with only one speaker: Alice.

# join a conversation
client.join_conversation("my-conversation-id", speaker="Alice")

# you can also join a conversation as an observer
client.join_conversation("my-conversation-id", speaker="Bob", readonly=True)

# if you are an active speaker, you can explicitly set some ASR parameters
client.join_conversation("my-conversation-id", speaker="Bob", model="fr", interim_results=False, rescoring=True, origin=int(time.time()*1000))

You can read more about ASR parameters here.

You can name the conversation however you like.

The speaker parameter is optional for the join_conversation method. If you don't provide one, the SDK will generate a random speaker name for you.

Send audio and receive transcription

This is were the fun begins. Now that we have a conversation we talk to, we can send an audio and receive its transcription.

When you join a conversation as a participant, you must be ready to stream audio immediately. Any gap in the stream longer than a few seconds will be interpreted as a lost connection and will terminate your session.

Send audio

We currently only support audio with a sample rate of 8kHz and a bit depth of 16 bits

You first need to get your audio_file.wav wav file and build a raw audio file with it using ffmpeg:

$ ffmpeg -v fatal -hide_banner -i audio_file.wav -y -vn -acodec pcm_s16le -ar 8000 -ac 1 audio_file.raw

Don't have an audio at hand? Download this audio file as an example

Now you can stream the audio for transcription (here from an audio file):

# Stream audio for transcription
with open("/path/to/audio_file.raw", "rb") as audio_file:
while True:
audio_chunk = audio_file.read(8000)
if not audio_chunk:
break
client.send_audio_chunk(audio_chunk)
time.sleep(0.5) # Simulate real time audio

Receive transcription

We receive the transcription as a succession of transcription events which are either intermediate transcription of words or full transcription of a segment of audio.

# Receive an event from the API
event = client.get_event()
if isinstance(event, WordsDecoded):
print("Intermediate transcript of a segment of audio")
print("Happens many times per second with incremental results")

if isinstance(event, SegmentDecoded):
print("Final transcript of a segment of audio")
print("Happens less frequently and exposes the final result")

You can display the whole transcription with something like:

# Print each transcribed segment
while True:
event = client.get_event() # There is a 3 seconds default timeout
if not event:
print("There are no more events")
break

if isinstance(event, SegmentDecoded):
print(f"[{event.speaker} - {event.utterance_id}] {event.transcript}")
# You can also dig into the event payload to get more data
transcript = event.payload["transcript"] # also as event.transcript
start = event.payload["start"] # also as event.start
end = event.payload["end"] # also as event.end
print(f"[{start}->{end}]: {transcript}")

When a speaker joins the conversation, a SpeakerJoined event is emitted.
When a speaker leaves the conversation, a SpeakerLeft event is emitted.

Receive enrich events

From time to time, when an NLU agent find interesting information in the conversation, it will emit enrich events.

The class in python has the same name but in CamelCase.

# Print EntityNumberFound events
while True:
event = client.get_event() # There is a 3 seconds default timeout
if not event:
print("There are no more events")
break

if isinstance(event, EntityNumberFound):
print(f"[{event.speaker}] {event.canonical}")
# You can also dig into the event payload to get more data
canonical = event.payload["annotation"].get("canonical", "no alternate representation") # also as event.canonical
original = event.payload["annotation"]["original"] # also as event.original
start = event.payload["start"] # also as event.start
end = event.payload["end"] # also as event.end
print(f"[{start}->{end}]: {canonical} replaces {original}")

Leave conversation

To cleanly leave a conversation without missing any transcription events from your audio audio stream, you should use the the .leave_conversation method and wait for your own SpeakerLeft event before disconnecting.

There you go! You are now able to send an audio and receive its transcription.

To dive in deeper, you can browse the API Reference documentation.

If you are stuck, want to suggest something, or just want to say hello, send us an e-mail to support@allo-media.fr.