Frequently Asked Questions — lmtool
1.4) What is lmtool not good for?
1.4.1) What are my alternatives?
1.5) Where did the lmtool come from?
2.2) What does the "advanced" version of lmtool do?
2.3) What are some technical limitations of lmtool?
2.4) Why do I need to condition my corpus?
3.1) What is a language model?
3.2) Why does lmtool produce statistical language models?
3.3) What is the format of the language model file?
3.4) How does the lmtool language modeler work?.
3.5) Can I get a copy of quicklm?
4.1) How are lmtool pronunciations generated?
4.2) What are some of the technical limitations of pronunciation generation?
4.3) Why is lmtool giving me this totally wrong pronunciation?
is a web based tool that allows users to quickly compile two text-based
components needed for using an ASR decoder. These are the language model
and the lexical model (referred to variously as the 'pronouncing
dictionary' or simply the ‘dictionary’). The remaining component is the
acoustic model. Together these are sometimes referred to as the decoder knowledge
Both language modeling and pronunciation modeling are multi-step procedures and are typically embodied in a set of (shell) scripts. The purpose of lmtool is to hide this complexity and to simplify knowledge base generation as much as possible.
· Developers who are building systems and need to quickly configure their system for testing or initial deployment.
· Researchers concentrating on other aspects of speech recognition or spoken language systems and just need something that works.
Lmtool is best for small domains, application prototyping and for small scale experimentation with ASR systems. lmtool has been developed primarily for use with the Sphinx recognition system and takes into account its characteristics. Nevertheless the language model and pronunciation formats are the same as used by many, if not most, other research recognition systems. You should be able to use lmtool to generate knowledge bases for those systems as well.
· Building models from large corpora, either in vocabulary size or in the amount of training data available. “Large” is a relative term; you should think of it as some function of vocabulary size and the number of distinct n-grams in your corpus. Lmtool will tell you if it believes your submission is too large.
· When you need to have finer control over model generation, such as using different discounting schemes or the need to access intermediate products, such as count table.
Currently the following two toolkits are in widespread use: the CMU-CU slmtk (open source) and the SRILM (free, but under license). There are several other language modeling toolkits available that are specialized for working with very large corpora, however most of these are no used in speech recognition but for other applications that use language models, for example statistical machine translation.
lmtool was developed in the Speech Group at Carnegie Mellon University in the early 1990's as a convenient way to quickly generate language models for experiments using the Sphinx decoder. It was used to generate language models for very small domains, particularly ones for spoken language interfaces. Since then it has been extended a number of times and has generally been updated to track the needs of its users. Alexander Rudnicky built the initial version and has been maintaining it since then.
need a corpus, which in this case means a set of sentences (or more precisely,
utterances) that you expect your recognition system to be able to handle (i.e.,
what people might reasonably want to say to your system).
The corpus needs to be in the form of an ASCII text file, with one sentence to a line. Upload this file, click the compile button, and your language model and dictionary will appear. You do need to prepare the text according to certain conventions. See Conditioning, below.
As you gain experience with your application you may notice that some of the pronunciations generated by lmtool are not quite right or are different from how you or your users pronounce the words. The advanced tool allows you to also upload your own custom dictionary, whose pronunciations will override the standard ones. Earlier versions of the advanced tool gave the user control over additional aspects of compilation. This was at a time at which there was less standardization in Sphinx and, generally, in ASR. Currently many of these options are no longer of use. They will be removed in newer versions.
It is most appropriate for generation English-language models. There is a current limit of 6000 word for the corpus. If you need non-English models you may still be able to use the tool, either by downloading quicklm or by using the handdict option (see below). Otherwise we recommend that you use one of the standard tools described above.
To produce satisfactory models you should observe the following guidelines:
· Avoid the use of mixed-case text, specifically different versions of the same word. "Black" and "black" will be treated as different words. On the other hand, if "Black" is indeed a different word, say a proper noun, you may want to retain the distinction.
· Remove punctuation, since it can introduce irrelevant word variants, i.e., "therefore," and "therefore" will be treated as different words.
· Enter all numbers in terms of the words you expect them to be spoken, e.g., "23" should be "twenty three". Otherwise your number will be rendered as a digit sequence, e.g. “two three”.
· Text conditioning is unfortunately a bit of a black art, and correct interpretation often depends on context. For example, "id" could be an abbreviation for "identification"; then again it could be a technical term in a discussion of psychoanalysis. There are at least two problems: the two words will have different pronunciation and the words might have very different distributional characteristics if both occur in the same corpus. You want to avoid such ambiguities.
Please see the page on Conditioning for additional discussion and suggestions. Note that lmtool will not do any conditioning for you, as it doesn’t want to make any assumptions about what you are trying to do.
The language model used in a decoder captures the constraint inherent in the word sequences found in a corpus. This information is used to constrain search and as a result significantly improve recognition accuracy. It follows that using a language model that does not correspond to your target domain will result in very poor recognition performance (not only because sequence statistics are incorrect but also because the vocabulary will be incomplete). Note that the lmtool model is statistical.
It’s the customary way to configure the Sphinx decoder. Many commercial decoders use finite-state grammars (FSG) to perform the same function; these are also referred to as language models, but they are very different in conception and in operation. Sphinx can use FSG’s; currently you need to figure out the details on your own.
The language model file is plain text. The format is the commonly used "arpa" format which is standard in speech recognition research. It lists 1-,2- and 3-grams along with their likelihood (the first field) and a back-off factor (the third field).
lmtool uses the quicklm statistical language modeler. quicklm was designed to create suitable language models from very little data, say corpora smaller than about 50,000 words. If you have a larger corpus available you should be using one of the standard tools mentioned above. quicklm assumes that n-gram counts computed from small corpora are inherently unreliable and consequently uses a discounting scheme not tied to counts and which applies a uniform "ratio" discount to all n-grams. The ratio can be varied to control the "looseness" of the model. Experiments indicate that the resulting models produce better recognition performance than standard approaches (but, again, only for very small corpus sizes).
Yes. You can download it; it is in open source. Once you get it and look at the code you will realize that while it is a quite terse implementation of a language modeler, it is also quite inefficient. This is another reason to use a standard toolkit if your corpus starts to get big.
lmtool will try to give you a pronunciation for every word-like string in your corpus (by word we mean any sequence of characters delimited by whitespace). It does this in two ways, by querying a large pronunciation dictionary (cmudict) and by applying letter-to-sound (LtoS) rules to words not found in the dictionary. Actually, if a word is not found in the dictionary, lmtool first tries to apply a few simple affix rules to the unknown word to see if there's a baseform available that can be deterministically modified (for example, many plurals or possessives are predictable). If no dictionary-based pronunciation is available, lmtool applies LtoS rules. These rules are based on systematic patterns in the English language (such as they are).
The two main ones to bear in mind are: 1) word tokens are expected to contain fewer than 35 characters. 2) a token may contain only ASCII characters. These limitations are present because lmtool is currently Anglo-centric. Non-English words may often result in bad pronunciation guesses. This includes loan words (“entrepreneur”), proper names (“Megher”) and technical terms (such as found in biology or medicine). Although this does not address the pronunciation issue, we are working to extend the system to be able to deal with UTF-8 text and to include other languages.
First, note that in English the relationship between orthography and pronunciation is rather complicated. There are a variety of interesting reasons for this, but unfortunately a full discussion is beyond the scope of the current document.
If you encounter this problem and if you know the correct pronunciation, you should add it to a custom dictionary and lmtool will use it (you can do this on the lmtool-advanced page). Of course the simplest reason for a bad pronunciation is just that the dictionary has an error in it (it was created by humans); in this case we would appreciate it if you could let us know about it so that we can fix the entry.
This is a custom pronunciation dictionary that users can optionally upload from the lmtool-advanced page that allows the user to override pronunciations generated by the system. You may have several different reasons for doing this. For example, your corpus may contain foreign loan words that are absent from the dictionary and that are not rendered correctly by the LtoS. Or you may have terms that exceed the character-count limit for word tokens (35).
For more information please contact the maintainer, Alex Rudnicky (air at cs cmu edu).