
|
LOGIOS Lexicon Tool
|
This tool generates a pronunciation dictionary suitable for
configuring speech recognition systems that conform to ARPA-derived
file formats. In particular it creates lexicons suitable for use with
the Sphinx system. It is a component of
the Logios
package which allows you to input a Phoenix grammar and receive a
compiler grammar, an n-gram language model and a pronouncing
dictionary.
The tool currently accesses cmudict.0.7b
and produces pronunciations using the (currently standard) 40 item
phone inventory. Please note that the dictionary may be updated from time to time and
that consequently your results may vary as a consequence,
we hope in the direction of greater accuracy :-).
If you notice any errors in the output (such as a seemingly incorrect
pronunciation), please report it and we will look into it.
You can send reports to air:cs'cmu,|edu|.
An example
If your input file looks something like this left-hand column: |
Your output file will look something like this right-hand column: |
Hello
|
HELLO HH EH L OW HELLO(1) HH AH L OW
|
world
compound_word
hyphen-ated
ONE23
2008
boom!
kweezlebotter
|
WORLD W ER L D
COMPOUND_WORD K AA M P AW N D W ER D
HYPHEN-ATED HH AY F AH N EY T IH D
ONE23 OW EH N IY T UW TH R IY
2008 T UW Z IY R OW Z IY R OW EY T
BOOM! B UW M
KWEEZLEBOTTER K W IY Z L AH B AA T AH R
|
Please note the following:
- Some words may have multiple pronunciations; these will appear on
separate line and will be differentiated by an instance id such as
"(1)". The current implementation of the Sphinx decoder expects each
dictionary entry to be unique. Note however that this tool does not
check for uniqueness, so if you include multiple instances of an input
word it will appear multiple times. As a rule you want to sort your
input files before you submit them.
- Words with internal separators such as "_" and "-" will be
rendered as a single word; the internal characters will be kept as part
of the orthographic element.
- Alpha-numeric items, as well as numbers, will be rendered
character-by-character. This is because such items are ambiguous and
can be rendered several ways (e.g., "one two three", "one
twenty-three", etc.) It is you responsibility to determine how such
items will be spoken. Typically this will vary by domain.
- Punctuation marks will be ignored
- Words that do not exist in the tool's dictionary will be
generated according to letter-to-sound rules. There is no guarantee
that such a pronunciation will be correct. You are advised to check these before use.
-
If you choose to manually alter pronunciations, be sure that you follow the formatting; and be sure that the phones are part of the legal set.