QuickLM: A Simple Language Model Compiler

This is an experimental language model compiler that will produce a conventional backed-off trigram language model based on a very small corpus of text data. If you're interested in generating models for use with Sphinx-II, please go to this page. QuickLM is a part of the Sphinx Knowledge Tools

To use QuickLM, do the following:

Create a corpus. By convention in spoken language systems, a corpus consists of "sentences", which nominally correspond to individual utterances that a speaker might produce in the context of a particular task. For example, in the reference North American Business News dictation task, "sentences" correspond to actual sentences in published newswire stories. In a command system, these might correspond to individual commands.
- QuickLM treats newlines as utterance delimiters, thus no 2-,3-grams spanning an utterance boundary are computed. If you wish to do so, do not include newlines in your corpus.
- By convention sentences are often bracketed by the delimiters <s> and </s>, as in: <s> THIS IS A SENTENCE </s>. The compiler doesn't actually care what you do, so feel free to adopt whatever convention(s) are consonant with your problem. Note however that some decoders (notably Sphinx) expects these to be present.
- This compiler does not attempt to condition your text, meaning for example, that the process is case-sensitive and that punctuation is not removed. This may or may not be what you want.
Type in, or cut and paste the sentences (one sentence per line) into the text field below.
Click on Compile

You will receive a standard-form trigram language model, with backoffs. The backoffs are computed using a ratio discount of 0.5 (applied to all counts). For our purposes, we have found this to give reasonable performance (that is, equal to or better) in comparison to standard discounting schemes (such as absolute discounting). This will be the case for very small domains/corpora but may not hold for conventional corpora, which are usually substantially larger. For current purposes "small" means <50k words) Smaller discounts will decrease recognition performance.

The language model format corresponds to that currently in use by the ARPA Speech community (and by many other researchers). You can save this into a file from your browser and use it as any other language model.

This procedure is provided to illustrate the relationship between a small text corpus and its corresponding language model.

Because it is based on a small corpus, it may not be suitable for use in your particular application. This compiler was produced in the course of research into the problem of creating language models from very small corpora.

(At some point I will include a detailed explanation of how the model is computed.)