Mod9 ASR Engine

[ Overview || TCP | Python | REST | WebSocket || Licensing | Deployment | Customization | Models ]

Mod9 Automatic Speech Recognition (ASR) Engine

The Mod9 ASR Engine consists of software and models for automatic speech recognition (ASR, a.k.a. speech-to-text).

  • Software:
    • Client: application developers can use a TCP reference, a Python SDK, a REST API, or a WebSocket interface.
    • Server: the engine is a single binary file, statically compiled from C++ using Kaldi, Boost, and Tensorflow.
  • Models:
    • ASR: English and Spanish are included; or bring your own models trained with Kaldi (e.g. Vosk, Zamia).
    • NLP: improve the readability of transcripts with punctuation, capitalization, and number conversion.

The Engine software enables unparalleled customization to improve upon the benchmark accuracy of the ASR model.
It also provides rather advanced features, providing uniquely detailed results that can help mitigate transcription errors.

This is on-premise software (not a cloud-hosted service), intended for deployment by operators who need to ensure data privacy, maximize batch processing throughput, minimize real-time streaming latency, or simply reduce costs.

The Engine is neither free nor open-source, but its licensing does enable anonymous evaluation of a public Docker image.

Quick start

# Download the Docker image which packages the Engine software with several models (~3GB).
docker pull mod9/asr

# Run the Engine as a container, mapping its default TCP port (9900) from the host.
docker run -p 9900:9900 mod9/asr

This Engine operates as a TCP service, and is designed for convenient ad-hoc usage with command-line tools:

# Basic protocol: send JSON-encoded request options, followed by a stream of audio data.
(echo '{"transcript-formatted":true}'; curl -s | nc localhost 9900

Real-time streaming and batch processing

For live demonstration on a Mac (using brew to install SoX), stream audio from the microphone:

# This tool can read from an audio device (-d) and convert to 1-channel 16-bit WAV (-c1 -b16 -twav).
brew install sox

# The "partial" request option will output each word as it is recognized.
(echo '{"partial":true}'; sox -qV1 -d -c1 -b16 -twav - ) | nc localhost 9900

In this online mode, the Engine is suitable for real-time streaming; the example above should only use about 10-20% CPU. In batch mode, processing is distributed across CPU threads to improve speed and throughput with pre-recorded files. For example, a 5-minute phone call (SW_4824 in the Switchboard benchmark) can be transcribed in seconds:

# Download an audio file; then send it to the public evaluation server, requesting an optimized ASR model.
curl -O
(echo '{"batch-threads":10,"asr-model":"en-US_phone-benchmark"}'; cat SW_4824_B.wav) | nc 9900

Note that the example above uses the evaluation server rather than localhost. This public server has high compute capacity, and is suitable for handling non-sensitive test data sent over unencrypted TCP.

DevOps and model packaging

The shell-oriented examples above are further described in the TCP reference, which allows an application developer to directly interact with the Engine using standard network sockets. Several high-level wrappers are also provided:

In contrast to a client who interacts with the Engine, an operator is responsible for deploying and maintaining it as a server. We provide a deployment guide with further documentation addressed to that role's perspective.

A third role is that of the model packager. Documentation of the expected model structure is provided so that the Engine may load any ASR models that are trained or built using Kaldi software. This generally requires specialized ASR expertise.

Example: model customization

# Use 2 ASR models: both are US telephony models; the second is faster but less accurate.

# Run a Docker container, named "engine", listening on the host's forwarded port 9900.
# The --models.asr option will load the first listed model as the default ASR model.
# The --models.mutable option will enable clients to request dynamic changes to the models.
docker run -d --name=engine -p 9900:9900 mod9/asr engine --models.asr=$ASR1,$ASR2 --models.mutable

# Copy a test file that is packaged in the Engine's Docker container: it's someone talking about cars.
docker cp engine:/var/www/html/SW_4824_B.wav .

# The default model correctly recognizes "honda" three times, and is fast when batched over 10 threads.
(echo '{"batch-threads":10}'; cat SW_4824_B.wav) | nc localhost 9900 | grep honda

# The second model is faster, even when limited to 1 thread; but it missed recognizing "honda" once.
(echo '{"batch-threads":1,"asr-model":"'$ASR2'"}'; cat SW_4824_B.wav) | nc localhost 9900 | grep honda

# However, for a known topic of conversation, this model can be biased to recognize relevant words.
nc localhost 9900 <<< '{"command":"bias-words","asr-model":"'$ASR2'","words":[{"word":"honda","bias":5}]}'

# Try again: now the biased second model can correctly recognize all three instances of the word!
(echo '{"batch-threads":1,"asr-model":"'$ASR2'"}'; cat SW_4824_B.wav) | nc localhost 9900 | grep honda

# Send a SIGTERM signal, handled by the Engine as a graceful shutdown, before removing the container.
docker stop engine && docker rm engine

As shown in this example, models can be biased to favor (or disfavor) certain words. Another way to customize words would be to add new words with specific pronunciations. It's worth noting that these model customizations happen on-the-fly, and could even be used to improve results while a real-time audio stream is currently being recognized.

When possible word sequences are highly constrained, such as directed dialog or command & control use cases, a custom grammar can also be specified with each request. This may also be suitable for embedded devices.

More information

This documentation includes some additional endpoints that might be convenient:

To learn more about the Mod9 ASR Engine and its breadth of capabilities, contact for a demonstration.

©2019-2021 Mod9 Technologies (Version 1.0.1)