Mod9 ASR Engine

[ Overview || TCP | Python | REST | WebSocket || Licensing | Deployment | Customization | Models ]

Mod9 Automatic Speech Recognition (ASR) Engine

The Mod9 ASR Engine consists of software and models for automatic speech recognition (ASR, a.k.a. speech-to-text).

  • Software:
    • Server: the engine is a single binary file, statically compiled from C++ using Kaldi, Boost, and Tensorflow.
    • Client: application developers can use a TCP reference, a Python SDK, a REST API, or a WebSocket interface.
  • Models:
    • ASR: English and Spanish are included; or bring your own models trained with Kaldi (e.g. Vosk, Zamia).
    • NLP: improve the readability of transcripts with punctuation, capitalization, and number conversion.

The Engine software enables dynamic customization to improve upon the benchmark accuracy of the ASR model.
It offers many advanced options to produce uniquely detailed results that can help mitigate transcription errors.

This is pure Software (not as a Service) intended for self-hosted deployment by operators who need to:

  • ensure data privacy (no cloud) and guarantee service availability (no downtime),
  • maximize processing speed (>100x real-time) and minimize streaming latency (<100ms),
  • or simply get started (no signup) with flexible licensing (free evaluation).

Installation

The software can be downloaded onto any Linux system, requiring no runtime dependencies:

# Download a tarball of the Engine binary and models, installed in /opt as root.
curl -sL mod9.io/mod9-asr.tar | sudo tar xv -C /opt

# Run the Engine executable.
/opt/mod9-asr/bin/engine

Alternatively, it can be obtained from a public Docker repository:

# This Docker image has several layers, totaling about 3GB.
docker pull mod9/asr

# Run the Engine as a container, mapping its default TCP port (9900) from the host.
docker run -p 9900:9900 mod9/asr

An advantage of the Docker image is that it facilitates deployment of the Engine as secure web services:

# Host-mounted SSL certificates issued by letsencrypt.org will be automatically created or renewed.
# The https-engine command starts an Apache web server listening on port 443 and will wrap the Engine
# as a REST API (https://example.com/rest/api) and WebSocket interface (wss://example.com).
docker run -v /etc/letsencrypt:/etc/letsencrypt -p 443:443 mod9/asr https-engine

Note: it's best practice to use a specifically versioned archive or image, e.g. mod9-asr-1.1.0.tar or mod9/asr:1.1.0.

Example: command-line usage

The Engine server was designed for convenient ad-hoc usage with command-line tools. For example, the nc TCP client can connect to a local Engine at its default port 9900, also using echo and cat commands to form the request:

# Download a test audio file.
curl -O https://mod9.io/hi.wav

# Basic protocol: send JSON-encoded request options, followed by a stream of audio data.
(echo '{"transcript-formatted":true}'; cat hi.wav) | nc localhost 9900

Example: real-time streaming (command-line, local)

For live demonstration on a Mac (using brew to install SoX), stream audio from the microphone:

# This tool can read from an audio device (-d) and convert to 1-channel 16-bit WAV (-c1 -b16 -twav).
brew install sox

# The `"partial"` request option will output each word as it is recognized.
(echo '{"partial":true}'; sox -qV1 -d -c1 -b16 -twav - ) | nc localhost 9900

In this online mode, the Engine is suitable for real-time streaming; the example above should only use about 10-20% CPU.

Example: batch processing (command-line, insecure)

In batch mode, processing is distributed across CPU threads to improve speed and throughput with pre-recorded files. For example, a 5-minute phone call (SW_4824 in the Switchboard benchmark) can be transcribed in seconds:

# Download an audio file; then send it to the mod9.io server, requesting a telephone-optimized model.
curl -O https://mod9.io/SW_4824_B.wav
(echo '{"batch-threads":10,"asr-model":"en-US_phone"}'; cat SW_4824_B.wav) | nc mod9.io 9900

This example uses a publicly accesible mod9.io server, which has high compute capacity and is intended for evaluation. When communicating over unencrypted TCP like this, privacy-sensitive test data should not be sent to the server.

Example: real-time streaming (WebSocket, secure)

When operated as a Docker container with a mounted SSL certificate, the Engine's functionality can be wrapped as a WebSocket server to securely communicate with web browser clients. See websocket-demo.html for a webpage with Javascript code that can stream microphone audio to wss://mod9.io, i.e. publicly accessible but secured.

As a secure alternative to an earlier example, the Mod9 ASR Python SDK can install a command-line WebSocket client:

pip3 install mod9-asr
sox -qV1 -d -c1 -b16 -twav - | mod9-asr-websocket-client wss://mod9.io '{"partial":true}'

Example: batch processing (REST API, secure)

Operating a Docker container can also wrap the Engine with a REST API served over a secure communication layer:

curl https://mod9.io/rest/api/speech:recognize -X POST -H 'Content-Type: application/json'\
 -d '{"audio":{"uri":"https://mod9.io/SW_4824_B.wav"},"config":{"languageCode":"en-us"}}'

Note that this endpoint is designed for compatibility with Google's STT REST API, but further extends their functionality: the audio may be specified with an https:// (or http://) URI instead of gs://, and its duration is not limited to 60s.

Example: model customization

# Compare two ASR models: both are US telephony models; the second is faster but less accurate.
ASR1=mod9/en-US_phone-benchmark
ASR2=mod9/en-US_phone-smaller

# Run a Docker container, named "engine", listening on the host's forwarded port 9900.
# The --models.asr option will load the first listed model as the default ASR model.
# The --models.mutable option will enable clients to request dynamic changes to the models.
docker run -d --name=engine -p 9900:9900 mod9/asr engine --models.asr=$ASR1,$ASR2 --models.mutable

# Copy a test file that is packaged in the Engine's Docker container: it's someone talking about cars.
docker cp engine:/var/www/html/SW_4824_B.wav .

# The default model correctly recognizes "honda" three times, and is fast when batched over 10 threads.
(echo '{"batch-threads":10}'; cat SW_4824_B.wav) | nc localhost 9900 | grep honda

# The second model is faster, even when limited to 1 thread; but it missed recognizing "honda" once.
(echo '{"batch-threads":1,"asr-model":"'$ASR2'"}'; cat SW_4824_B.wav) | nc localhost 9900 | grep honda

# However, for a known topic of conversation, this model can be biased to recognize relevant words.
nc localhost 9900 <<< '{"command":"bias-words","asr-model":"'$ASR2'","words":[{"word":"honda","bias":5}]}'

# Try again: now the biased second model can correctly recognize all three instances of the word!
(echo '{"batch-threads":1,"asr-model":"'$ASR2'"}'; cat SW_4824_B.wav) | nc localhost 9900 | grep honda

# Send a SIGTERM signal, handled by the Engine as a graceful shutdown, before removing the container.
docker stop engine && docker rm engine

As shown in this example, models can be biased to favor (or disfavor) certain words. Another way to customize words would be to add new words with specific pronunciations. Note that these model customizations happen on-the-fly, and could even be used to improve results while a real-time audio stream is currently being recognized.

When possible word sequences are highly constrained, such as directed dialog or command & control use cases, a custom grammar can also be specified with each request. This may also be suitable for embedded devices.

DevOps and model packaging

The shell-oriented examples above are further described in the TCP reference, which allows an application developer to directly interact with the Engine using standard network sockets. Several high-level wrappers are also provided:

In contrast to a client who interacts with the Engine, an operator is responsible for deploying and maintaining it as a server. We provide a deployment guide with further documentation addressed to that role's perspective.

A third role is that of the model packager. Documentation of the expected model structure is provided so that the Engine may load any ASR models that are trained or built using Kaldi software. This generally requires specialized ASR expertise.

More information

This documentation includes some additional endpoints that might be convenient:

To learn more about the Mod9 ASR Engine and its breadth of capabilities, contact help@mod9.com for a demonstration.


©2019-2021 Mod9 Technologies (Version 1.1.0)