Model Packaging | Mod9 ASR Engine

[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]

Mod9 ASR Engine: Model Packaging

The Mod9 ASR Engine supports models to perform speech recognition (converting audio to text), pronunciation generation, and natural language processing (punctuation, capitalization, number formatting, and disfluency removal).

Automatic speech recognition (ASR) models are compatible with the Kaldi open-source toolkit.
More than 50 ASR models (20 languages and dialects) have been packaged at /opt/mod9-asr/models/asr.

Graphe-to-phoneme (G2P) and natural language processing (NLP) models are a combination of TensorFlow, OpenFST, and other custom models; these are generally compatible only with the Mod9 ASR Engine.

Below we describe the formats and layout required to use models with the Engine.

Directory layout

Models must be stored in a directory accessible to the Engine at run time. The default directory is /opt/mod9-asr/models/ but it can be specified when you start the Engine using the --models.path option. Below the models directory should be subdirectories named asr/, g2p/ and nlp/. Models related to automatic speech recognition are stored under the asr/ directory, grapheme-to-phoneme conversion models are under g2p/, and natural language processing models are stored under the nlp/ subdirectory. The models themselves are further sub-directories under each of these. For example, the en-US_phone ASR model for performing speech recognition on US English telephone data. It consists of files and directories under /opt/mod9-asr/models/asr/en-US_phone/ (which is more specifically an alias to /opt/mod9-asr/models/asr/mod9/en-US_phone/).

Automatic Speech Recognition (ASR) models

The Mod9 ASR Engine can use automatic speech recognition models that are compatible with models produced and used by the Kaldi toolkit. By default, the Engine looks for files in very specific locations and sets various configuration variables to sensible default values for most Kaldi models. You can, however, override the defaults (typically by creating or modifying conf/model.conf, described below).

In the sections below, files and directories are listed relative to the model directory described above (e.g. /opt/mod9-asr/models/asr/en-US_phone/).

Required ASR model files

Only 5 files are absolutely required to run an ASR model on the Engine. The locations listed below are the default where the Engine will look for the models. See the section on conf/model.conf below for overriding the default locations.

  • am/final.mdl - The acoustic model. In Kaldi recipes, this will typically be stored at a path named something like exp/chain/tdnn7q/final.mdl. It consists of the architecture and parameters describing the neural network used for acoustic modeling. Note that only nnet3 and chain models are supported.

  • graph/HCLG.fst - The decode graph. In Kaldi recipes, this will typically be stored in the graph directory, which is typically named something like exp/chain/tdnn7q/graph. It consists of an FST that represents the language model, lexicon, and state structure of the model.

  • graph/words.txt - A mapping between word IDs and word symbols. It will typically be in the same directory as HCLG.fst.

  • graph/phones/word_boundary.int - A mapping between phone IDs and where in a word the phone can occur. This can usually be found in phones/word_boundary.int under the same directory where HCLG.fst is stored.

  • conf/mfcc.conf - Feature configuration file. Although the Engine can theoretically operate without this file, the defaults will seldom be correct. We strongly recommend using conf/mfcc.conf. In Kaldi recipes, it can generally be found in conf/mfcc_hires.conf.

Optional ASR model files

There are several optional files. If these files are not provided, some functionality in the Engine may be disabled.

  • am/tree - A description of the low-level phonetic units used in the system. In Kaldi recipes, this would typically be found in the same directory as HCLG. If tree is not provided, custom words and custom grammars cannot be used.

  • graph/phones.txt - A mapping between phone IDs and phone symbols. This can usually be found in the same directory as word_boundary.int. If phones.txt isn't provided, custom words and custom grammars cannot be used.

  • conf/model.conf - A Kaldi-style configuration file. There is no directly equivalent file in typical Kaldi recipes; rather, the same information is usually encoded in scripts. Full documentation on the file can be found in comments in the models that ship with the Engine (e.g. asr/en/conf/model.conf).

Of particular note, you can override where the Engine looks for files. For example, if the model de_small had a file /opt/mod9-asr/models/asr/de_small/conf/model.conf that contained the line --am=exp1/expanded.mdl, the Engine would look for the acoustic model in /opt/mod9-asr/models/asr/de_small/exp1/expanded.mdl rather than the default /opt/mod9-asr/models/asr/de_small/am/final.mdl.

  • metadata.json - A json file with various metadata used by the Engine. Most of the fields in this file are informational only; however, the "language" field is used to match automatic speech recognition models with natural language processing models. See models that ship with the Engine for examples (e.g. asr/en/metadata.json).

Optional ivector subdirectory

ivectors are features related to speakers, and are very commonly used in research systems. Although they provide a statistically significant performance boost in many benchmarks, we find they tend to be brittle -- when the audio isn't a very good match to the training data, accuracy can suffer. Also, if speakers aren't on their own channel, ivectors can fail spectacularly.

The Engine supports ivectors if the directory ivector exists (e.g. /opt/mod9-asr/models/asr/mod9/en-US_phone-ivector/ivector/) and all required files exist under that directory. In Kaldi recipes, the files can typically be found under a directory named something like exp/chain/extractor. The required files are final.dubm, final.ie, final.mat, and global_cmvn.stats.

Compression and encryption of model files

The Engine transparently supports gzip compression of most Kaldi model files. This is most useful for HCLG.fst, which can shrink by a factor of 2 with compression. Note that the gzip file format can be automatically detected, so the default name for the graph is still HCLG.fst and not HCLG.fst.gz.

Some files in models shipped by Mod9 have been encrypted. The Engine will transparently unencrypt these files. As with compression, the file names do not need to be modified when they are encrypted; the encryption format is automatically detected.. These files may not be used other than in accordance with the licensing terms.


©2019-2022 Mod9 Technologies (Version 1.9.5)