Language Model Factory

Objectives of the service

Each client, or even each use case of the same client, needs a certain type of vocabulary, often uncommon in everyday language, to be correctly recognised in transcriptions:

  • Business-specific verbiage
  • Product names
  • Brand names

By adapting a language model to these specificities, a better transcription can be offered. This data set is commonly referred to as a "corpus".

Allo-Media response

In order to easily create a language model adapted to a given situation, Allo-Media has developed the Language Model Factory (LM Factory, or LMF). This tool allows users with no particular technical or linguistic knowledge to produce language models that can then be used throughout our platform. By adding extracts of common sentences from the use case in question as input, as well as proper names (brands, products), a specialised model is created as output. It is much more efficient than a generalist model for the themes for which it has been boosted. This new language model can then be used on all our APIs:

Details and clarifications

Overview of the processing chain

  • Input: expressions, verbiage, proper names
  • Processing: By the LM Factory
  • Output: Adapted language model, available to APIs and all our products

Data input protocols

At of now, data is exchanged directly with the account manager at Allo-Media. We plan to open an interface (API and webapp) to ease these exchanges. To facilitate data processing, the data must be prepared according to rules described in the technical documentation. These include spelling rules, use of abbreviations, writing conventions, etc.

Processing carried out

After validation with the account manager at Allo-Media, the data is sent for processing to build the language model. In addition to your corpus, continuous improvements from our R&D, our linguists, and more generally from feedbacks we gather are included. The construction itself takes only a few dozen minutes. A test session specific to the model built can then be carried out. It depends on the availability of a corpus of reference audios.

Data output protocols

Once the model is built, it can be made available across all our products and APIs.

Regulatory compliance

By default, the input corpus is kept in our platform, in order to facilitate its update and evolution, and thus that of your language models.