This is an experimental language model compiler that will produce a conventional backed-off trigram language model based on a very small corpus of text data. If you're interested in generating models for use with Sphinx-II, please go to this page. QuickLM is a part of the Sphinx Knowledge Tools
To use QuickLM, do the following:
You will receive a standard-form trigram language model, with backoffs. The backoffs are computed using a ratio discount of 0.5 (applied to all counts). For our purposes, we have found this to give reasonable performance (that is, equal to or better) in comparison to standard discounting schemes (such as absolute discounting). This will be the case for very small domains/corpora but may not hold for conventional corpora, which are usually substantially larger. For current purposes "small" means <50k words) Smaller discounts will decrease recognition performance.
The language model format corresponds to that currently in use by the ARPA Speech community (and by many other researchers). You can save this into a file from your browser and use it as any other language model.
This procedure is provided to illustrate the relationship between a small text corpus and its corresponding language model.
Because it is based on a small corpus, it may not be suitable for use in your particular application. This compiler was produced in the course of research into the problem of creating language models from very small corpora.
(At some point I will include a detailed explanation of how the model is computed.)