Mod9 ASR Engine

[ Overview || TCP | Python | REST | WebSocket || Server | Customization ]

Mod9 Automatic Speech Recognition (ASR) Engine

The Mod9 ASR Engine consists of software and models for automatic speech recognition (ASR, a.k.a. speech-to-text).
It is built using open-source libraries including Kaldi, Boost, and Tensorflow.

Examples

The core software operates as a TCP service, implementing an application-level protocol designed for convenient ad-hoc usage with standard command-line tools. For example, to transcribe a short test file (hi.wav):

(echo '{"transcript-formatted":true}'; curl https://mod9.io/hi.wav) | nc mod9.io 9900

The basic protocol is to send one line of JSON encoding the request options, followed by a stream of audio data. In these examples, nc makes a TCP connection to a publicly accessible server at mod9.io, listening on port 9900.

The Engine achieves state-of-the-art ASR accuracy at highly cost-effective speed and scale. By batch-processing across multiple threads, a 5-minute phone call (SW_4824 in the Switchboard benchmark) can be transcribed in seconds:

(echo '{"batch-threads":10}'; curl https://mod9.io/SW_4824_B.wav) | nc mod9.io 9900

The default model is ideal for US English telephony; a more robust 16k model recognizes audio sampled at 16,000 Hz. For live demonstration on a Mac (using brew to install SoX), stream audio from the microphone:

brew install sox
(echo '{"model":"16k","partial":true}'; sox -qV1 -d -twav -r16000 -c1 -b16 - ) | nc mod9.io 9900

Software interfaces

The shell-oriented examples above are further described in the TCP reference, which allows an application developer to interact with the Engine using standard network sockets. Several high-level wrappers are also provided:

In contrast to a client who interacts with the Engine, an operator is responsible for deploying and maintaining it as a server. We provide a server operator manual with further documentation addressed to that role's perspective.

Deployment and customization

Software and models are packaged in a Docker image; the statically compiled binary can run in any Linux environment. Intended for on-premise installation in a private network, the Engine requires no connectivity for license management.

The latest version can be enabled with advanced options. For example, it could be run as follows:

# Download a Docker image including software and models.
# Please contact sales@mod9.com to request access to this private Docker repository.
docker pull mod9/asr

# Run it as a Docker container, named "engine", listening on the host's forwarded port 9900.
# Load two models: "8k" will be the default; "8k-smaller" will be faster but less accurate.
# The --models.mutable option will enable clients to request dynamic changes to the models.
docker run -d --name=engine -p 9900:9900 mod9/asr engine --models.mutable 8k 8k-smaller

# Copy a test file that is packaged with the Engine: it's someone talking about cars.
docker cp engine:/opt/mod9-asr/htdocs/SW_4824_B.wav .

# The default model correctly recognizes "honda" three times, and is fast when batched over 10 threads.
(echo '{"batch-threads":10}'; cat SW_4824_B.wav) | nc localhost 9900 | grep honda

# The 8k-smaller model is faster, even when limited to 1 thread; but it missed recognizing "honda" once.
(echo '{"batch-threads":1,"model":"8k-smaller"}'; cat SW_4824_B.wav) | nc localhost 9900 | grep honda

# However, for a known topic of conversation, the model should be biased to recognize relevant words.
nc localhost 9900 <<< '{"command":"bias-words","model":"8k-smaller","words":[{"word":"honda","bias":5}]}'

# Try again: now the biased 8k-smaller model can correctly recognize all three instances of the word!
(echo '{"batch-threads":1,"model":"8k-smaller"}'; cat SW_4824_B.wav) | nc localhost 9900 | grep honda

# Send a SIGTERM signal, handled by the Engine as a graceful shutdown, before removing the container.
docker stop engine && docker rm engine

As shown in this example, models can be biased to favor (or disfavor) certain words. Another way to customize words would be to add new words with specific pronunciations. It's worth noting that these model customizations happen on-the-fly, and could even be used to improve results while a real-time audio stream is currently being recognized.

When possible word sequences are highly constrained, such as directed dialog or command & control use cases, a custom grammar can also be specified with each request. This may also be suitable for embedded devices.

To learn more about the Mod9 ASR Engine and its breadth of capabilities, contact sales@mod9.com for a demonstration.


©2019-2021 Mod9 Technologies (Version 0.9.0)