Language Model Factory
Objectives of the service
Each client, or even each use case of the same client, needs a certain type of vocabulary, often uncommon in everyday language, to be correctly recognised in transcriptions:
- Business-specific verbiage
- Product names
- Brand names
By adapting a language model to these specificities, a better transcription can be offered. This data set is commonly referred to as a "corpus".
In order to easily create a language model adapted to a given situation, Allo-Media has developed the Language Model Factory (LM Factory, or LMF). This tool allows users with no particular technical or linguistic knowledge to produce language models that can then be used throughout our platform. By adding extracts of common sentences from the use case in question as input, as well as proper names (brands, products), a specialised model is created as output. It is much more efficient than a generalist model for the themes for which it has been boosted. This new language model can then be used on all our APIs:
- Real-time conversational transcription
- Human-robot real-time transcription
- Batch Transcription
- Phone call analytics
Details and clarifications
Overview of the processing chain
- Input: expressions, verbiage, proper names
- Processing: By the LM Factory
- Output: Adapted language model, available to APIs and all our products
Data input protocols
At of now, data is exchanged directly with the account manager at Allo-Media. We plan to open an interface (API and webapp) to ease these exchanges. To facilitate data processing, the data must be prepared according to rules described in the technical documentation. These include spelling rules, use of abbreviations, writing conventions, etc.
Processing carried out
After validation with the account manager at Allo-Media, the data is sent for processing to build the language model. In addition to your corpus, continuous improvements from our R&D, our linguists, and more generally from feedbacks we gather are included. The construction itself takes only a few dozen minutes. A test session specific to the model built can then be carried out. It depends on the availability of a corpus of reference audios.
Data output protocols
Once the model is built, it can be made available across all our products and APIs.
By default, the input corpus is kept in our platform, in order to facilitate its update and evolution, and thus that of your language models.