[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]

Automatic Speech Recognition (ASR) Engine

The Mod9 ASR Engine consists of software and models for automatic speech recognition (ASR, a.k.a. speech-to-text).

Software:
- Server: the engine is a binary file that runs on Linux; it is statically built to include Kaldi, Boost, and Tensorflow.
- Client: applications can use a TCP reference, a C++ library, a Python SDK, a REST API, or a WebSocket interface.
Models:
- ASR: the default models work well for English and Spanish in a wide range of accents and conditions;
  - or browse ASR models for more than 25 languages, packaged from the open-source community;
  - or bring your own models trained on domain-specific speech data (Kaldi expertise required)!
- G2P: automatically generate pronunciations for new words when performing ASR model customization.
- NLP: improve the readability of transcripts with punctuation, capitalization, and number conversion.

The Engine software enables dynamic customization to improve upon the benchmark accuracy of these ASR models.
It offers many advanced options to produce uniquely detailed results that can help mitigate transcription errors.

This is pure Software (not as a Service) intended for self-hosted deployment by operators who need to:

ensure data privacy (no cloud) and guarantee service availability (no downtime),
maximize processing speed (>100x real-time) and minimize streaming latency (<100ms),
or simply get started (no signup) with flexible licensing (free evaluation).

Installation

The software can be downloaded onto any Linux system, requiring no runtime dependencies:

curl -sL mod9.io/mod9-asr.tar | sudo tar xv -C /opt

This tarball containing the Engine binary and models will be installed under /opt as root.

Alternatively, it can be pulled from a public Docker repository:

docker pull mod9/asr

Note that the above examples install the latest version of the software and models. For a stable deployment environment, it's best practice to specify a version in the file name or image tag, e.g. mod9-asr-1.9.8.tar or mod9/asr:1.9.8.

Optional: install more models

To complete the Linux installation, optionally install ASR models packaged from the open-source community:

curl -sL mod9.io/mod9-asr-models-1.9.8-omnibus.tar | sudo tar xv -C /opt/mod9-asr

With Docker, this full set of models (over 25GB) can be pulled as a single image:

docker pull mod9/asr:1.9.8-omnibus

To only install models from a specific source, replace omnibus with kaldi_asr_org, tuda, vosk, or zamia.

Example: run the Engine with default options, loading default models

If installed on a Linux host under /opt as indicated above, the Engine could be run in a standard way:

export PATH=$PATH:/opt/mod9-asr/bin
engine

This runs a TCP server that listens for connections on port 9900, after loading default English ASR, G2P, and NLP models. This engine command called with no arguments is equivalent to explicitly specifying these default option values:

engine --models.asr=en --models.g2p=en --models.nlp=en --models.path=/opt/mod9-asr/models --port=9900

Note that other models are installed under /opt/mod9-asr/models and can be specified to load when the Engine starts. Explore supported ASR models at mod9.io/opt/mod9-asr/models/asr; see also mod9.io/help (output of engine --help).

Example: run the Engine in Docker with advanced options

The Docker image specifies a default command of engine, so a simple docker run will work the same as above:

docker run mod9/asr

However, to illustrate Docker's utility, it can be useful to specify options to docker run as well as the engine command:

docker run -p 9900:9900 --memory=8G mod9/asr:1.9.8 engine --limit.memory-gib=6 --accept-license=yes

The -p 9900:9900 option maps the host port, enabling direct TCP access to the Engine from outside the container.

Note that this Docker container is limited to 8GiB of memory, above which the system's OOM-killer should be invoked. The engine applies its own "soft" limit at 6 GiB, refusing new processing requests if its usage exceeds this threshold.

The --accept-license option can be used with an evaluation-licensed Engine to remove some processing restrictions. See mod9.io/licensing or mod9.io/help-full (i.e. engine --license or engine --help-full) for more information.

Example: deploying a REST API and WebSocket server, secured over HTTPS

An advantage of using the Docker image is that the Engine can be easily deployed as a secure web service:

docker run -v /etc/letsencrypt:/etc/letsencrypt -p 443:443 mod9/asr:1.9.8-omnibus https-engine

The https-engine entrypoint command runs an Apache web server on port 443 (HTTPS), proxying requests through a REST API or WebSocket server instead of exposing the Engine over unsecured TCP on port 9900. The container will also run certbot to register and/or renew SSL certificates issued by letsencrypt.org; other certificates could also be loaded.

Suppose your server were hosting the mod9.io domain ... then the command above would operate a securely accessible REST API (see https://mod9.io/rest/api/operations/) as well as a WebSocket server (try https://mod9.io/websocket-demo). It would also serve some HTML documentation at https://mod9.io ... in fact, that's exactly how this site is deployed!

Warning: developing with `localhost` vs. `mod9.io` evaluation server

The next examples illustrate clients that communicate with an Engine at localhost or at mod9.io. It is recommended that you install your own Engine to enable localhost usage, which will guarantee optimal performance and privacy.

The examples can also specify a publicly accessible Engine server hosted at mod9.io, intended for convenient ad-hoc evaluation and to demonstrate high compute capacity. However, performance may be affected by load from other users.

The REST API and WebSocket server use SSL encryption on port 443. Also, data is not retained on the mod9.io server. However: do not send sensitive data to mod9.io on port 9900, due to its insecure use of unencrypted transport.

Example: command-line usage with `echo`, `cat`, and `nc`

The Engine server was designed for convenient ad-hoc usage with command-line tools. For example, the nc TCP client can connect to a local Engine at its default port 9900, also using echo and cat commands to form the request:

# Download a 1-second test audio file and request a formatted transcript from a local Engine server.
curl -O https://mod9.io/hi.wav
(echo '{"transcript-formatted":true}'; cat hi.wav) | nc localhost 9900

This example illustrates the basic TCP usage: send JSON-encoded request options, followed by a stream of audio data.

NOTE: it'd be fine to replace localhost with mod9.io for examples like this where the test file is not privacy-sensitive.

Example: batch processing with multiple CPU threads

In batch mode, processing is distributed across CPU threads to improve speed and throughput with pre-recorded files. For example, a 5-minute phone call (SW_4824 in the Switchboard benchmark) can be transcribed in seconds:

# Download a non-sensitive file and send to the mod9.io server, requesting a telephone-optimized model.
curl -O https://mod9.io/SW_4824_B.wav
(echo '{"batch-threads":10,"asr-model":"en-US_phone"}'; cat SW_4824_B.wav) | nc mod9.io 9900

If using localhost instead, try setting "batch-threads":-1 to use the total number of CPUs available on your system.

NOTE: this example also demonstrates how to specify a non-default ASR model using the "asr-model" request option.

Example: batch processing with Google-compatible REST API

Run as a Docker container, the Engine can be wrapped with a REST API served over a secure communication layer:

curl https://mod9.io/rest/api/speech:recognize -X POST -H 'Content-Type: application/json'\
 -d '{"audio":{"uri":"https://mod9.io/SW_4824_B.wav"},"config":{"languageCode":"en-us"}}'

NOTE: this endpoint is designed for compatibility with the Google STT REST API, but further extends their functionality: the audio URI may be specified as https:// (or http://) instead of only gs://, and its duration is not limited to 60s.

Example: real-time streaming from a Mac

For live demonstration on a Mac (using brew to install SoX), stream audio from the microphone:

# This tool can read from an audio device (-d) and convert to 1-channel 16-bit WAV (-c1 -b16 -twav).
brew install sox

# The `"partial"` request option will output each word as it is recognized.
(echo '{"partial":true}'; sox -qV1 -d -c1 -b16 -twav - ) | nc localhost 9900

In this online mode, the Engine is suitable for real-time streaming; the example above should only use about 10-20% CPU.

NOTE: Docker Desktop for Mac may be needed for localhost; or use mod9.io and beware of privacy implications.

Example: real-time streaming over secured WebSocket connection

As a more secure alternative to the previous example, install the Python SDK to run a command-line WebSocket client:

pip3 install mod9-asr
sox -qV1 -d -c1 -b16 -twav - | mod9-asr-websocket-client wss://mod9.io '{"partial":true}'

See also the websocket-demo.html webpage; its embedded Javascript streams microphone audio to wss://mod9.io.

NOTE: run a Docker container with your own mounted SSL certificate, then replace mod9.io with your server's domain. For ws://localhost or otherwise unsecured WebSocket use, replace the https-engine entrypoint with http-engine.

Example: customized recognition with client-specified phrase biasing

The Engine can improve accuracy with client-specified phrase biasing, a.k.a. "boosting" in Google STT parlance.
To demonstrate, try to recognize hi.wav with a very small an inaccurate ASR model intended for testing purposes:

(echo '{"asr-model":"mod9/en-US_phone-smaller"}'; cat hi.wav) | nc mod9.io 9900

The Engine will not correctly transcribe "hi can you hear me". However, the "phrase-biases" option could slightly boost "can you hear" (a plausible adaptation for contact center applications) and recover the correct transcription:

(echo '{"asr-model":"mod9/en-US_phone-smaller","phrase-biases":{"can you hear":5}}'; cat hi.wav) | nc mod9.io 9900

NOTE: while Google STT also supports positive boosting, only the Engine allows negative biases for undesired phrases.

Example: customized recognition with a task-specific grammar

When possible word sequences are highly constrained, such as for directed dialog or voice commands, a custom grammar can also be specified with each request. This may also be suitable for embedded devices.

# Download a small grammar specifying potential commands for a voice-controlled text editor.
curl -O https://mod9.io/voice-editor-commands.json

# Download an example utterance of someone saying "copy previous word".
curl -O https://mod9.io/copy_previous_word.wav

(cat voice-editor-commands.json; cat copy_previous_word.wav) | nc mod9.io 9900

Further model customization: `add-words` and `add-grammar`

In examples above, the Engine can recognize user-specified phrases or grammars; this is independent for each request.

Another approach is to define custom words with the add-words command to recognize specific pronunciations. There is also an add-grammar command to recognize a task-specific grammar alongside a large-vocabulary language model, improving accuracy for number sequences with a structured pattern, for example; this feature is currently in "beta".

These model customizations happen on-the-fly, affecting any requests that are currently being processed. This therefore requires cooperative clients, and the Engine operator must also enable the --models.mutable option.

DevOps and model packaging

The shell-oriented examples above are further described in the TCP reference, which allows an application developer to directly interact with the Engine using standard network sockets. Several high-level wrappers are also provided:

A Python SDK that is fully compatible as a drop-in replacement for the google-cloud-speech library.
A REST API that can serve as a local alternative to the Google Cloud Speech-to-Text API (non-streaming JSON).
A WebSocket interface including a server and example client code to enable streaming web applications.
A C++ library implementing a network client to communicate with an Engine server.

In contrast to a client who interacts with the Engine, an operator is responsible for deploying and maintaining it as a server. We provide a deployment guide with further documentation addressed to that role's perspective.

A third role is that of the model packager. Documentation of the expected model structure is provided so that the Engine may load any ASR models that are trained or built using Kaldi software. This generally requires specialized ASR expertise.

More information

This documentation includes some additional endpoints that might be convenient:

To learn more about the Mod9 ASR Engine and its breadth of capabilities, ask help@mod9.com for a demonstration.

Mod9 ASR Engine