[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]
The Mod9 ASR Engine consists of software and models for automatic speech recognition (ASR, a.k.a. speech-to-text).
- Software:
-
Models:
-
ASR: the default models work well for English and Spanish in a wide range of accents and conditions;
- or browse ASR models for more than 25 languages, packaged from the open-source community;
- or bring your own models trained on domain-specific speech data (Kaldi expertise required)!
- G2P: automatically generate pronunciations for new words when performing ASR model customization.
- NLP: improve the readability of transcripts with punctuation, capitalization, and number conversion.
-
ASR: the default models work well for English and Spanish in a wide range of accents and conditions;
The Engine software enables dynamic customization to improve upon the
benchmark accuracy of these ASR models.
It offers many advanced options to
produce uniquely detailed results that can help mitigate transcription errors.
This is pure Software (not as a Service) intended for self-hosted deployment by operators who need to:
- ensure data privacy (no cloud) and guarantee service availability (no downtime),
- maximize processing speed (>100x real-time) and minimize streaming latency (<100ms),
- or simply get started (no signup) with flexible licensing (free evaluation).
The software can be downloaded onto any Linux system, requiring no runtime dependencies:
curl -sL mod9.io/mod9-asr.tar | sudo tar xv -C /opt
This tarball containing the Engine binary and models will be installed under /opt
as root.
Alternatively, it can be pulled from a public Docker repository:
docker pull mod9/asr
Note that the above examples install the latest version of the software and models.
For a stable deployment environment, it's best practice to specify a version in the file name or image tag,
e.g. mod9-asr-1.9.5.tar
or mod9/asr:1.9.5
.
To complete the Linux installation, optionally install ASR models packaged from the open-source community:
curl -sL mod9.io/mod9-asr-models-1.9.5-omnibus.tar | sudo tar xv -C /opt/mod9-asr
With Docker, this full set of models (over 25GB) can be pulled as a single image:
docker pull mod9/asr:1.9.5-omnibus
To only install models from a specific source, replace omnibus
with kaldi_asr_org
, tuda
, vosk
, or zamia
.
If installed on a Linux host under /opt
as indicated above, the Engine could be run in a standard way:
export PATH=$PATH:/opt/mod9-asr/bin
engine
This runs a TCP server that listens for connections on port 9900, after loading default English ASR, G2P, and NLP models.
This engine
command called with no arguments is equivalent to explicitly specifying these default option values:
engine --models.asr=en --models.g2p=en --models.nlp=en --models.path=/opt/mod9-asr/models --port=9900
Note that other models are installed under /opt/mod9-asr/models
and can be specified to load when the Engine starts.
Explore supported ASR models at mod9.io/opt/mod9-asr/models/asr;
see also mod9.io/help (output of engine --help
).
The Docker image specifies a default command of engine
, so a simple docker run
will work the same as above:
docker run mod9/asr
However, to illustrate Docker's utility, it can be useful to specify options to docker run
as well as the engine
command:
docker run -p 9900:9900 --memory=8G mod9/asr:1.9.5 engine --limit.memory-gib=6 --accept-license=yes
The -p 9900:9900
option maps the host port, enabling direct TCP access to the Engine from outside the container.
Note that this Docker container is limited to 8GiB of memory, above which the system's OOM-killer should be invoked.
The engine
applies its own "soft" limit at 6 GiB, refusing new processing requests if its usage exceeds this threshold.
The --accept-license
option can be used with an evaluation-licensed Engine to remove some processing restrictions.
See mod9.io/licensing or mod9.io/help-full (i.e. engine --license
or engine --help-full
) for more information.
An advantage of using the Docker image is that the Engine can be easily deployed as a secure web service:
docker run -v /etc/letsencrypt:/etc/letsencrypt -p 443:443 mod9/asr:1.9.5-omnibus https-engine
The https-engine
entrypoint command runs an Apache web server on port 443 (HTTPS),
proxying requests through a REST API or WebSocket server instead of exposing the Engine over unsecured TCP on port 9900.
The container will also run certbot
to register and/or renew SSL certificates issued by letsencrypt.org;
other certificates could also be loaded.
Suppose your server were hosting the mod9.io
domain ...
then the command above would operate a securely accessible
REST API (see https://mod9.io/rest/api/operations/)
as well as a WebSocket server (try https://mod9.io/websocket-demo).
It would also serve some HTML documentation at https://mod9.io ... in fact, that's exactly how this site is deployed!
The next examples illustrate clients that communicate with an Engine at localhost
or at mod9.io
.
It is recommended that you install your own Engine to enable localhost
usage, which will guarantee optimal performance and privacy.
The examples can also specify a publicly accessible Engine server hosted at mod9.io
,
intended for convenient ad-hoc evaluation and to demonstrate high compute capacity.
However, performance may be affected by load from other users.
The REST API and WebSocket server use SSL encryption on port 443. Also, data is not retained on the mod9.io
server.
However: do not send sensitive data to mod9.io
on port 9900, due to its insecure use of unencrypted transport.
The Engine server was designed for convenient ad-hoc usage with command-line tools.
For example, the nc
TCP client can connect to a local Engine at its default port 9900,
also using echo
and cat
commands to form the request:
# Download a 1-second test audio file and request a formatted transcript from a local Engine server.
curl -O https://mod9.io/hi.wav
(echo '{"transcript-formatted":true}'; cat hi.wav) | nc localhost 9900
This example illustrates the basic TCP usage: send JSON-encoded request options, followed by a stream of audio data.
NOTE: it'd be fine to replace localhost
with mod9.io
for examples like this where the test file is not privacy-sensitive.
In batch mode, processing is distributed across CPU threads to improve speed and throughput with pre-recorded files. For example, a 5-minute phone call (SW_4824 in the Switchboard benchmark) can be transcribed in seconds:
# Download a non-sensitive file and send to the mod9.io server, requesting a telephone-optimized model.
curl -O https://mod9.io/SW_4824_B.wav
(echo '{"batch-threads":10,"asr-model":"en-US_phone"}'; cat SW_4824_B.wav) | nc mod9.io 9900
If using localhost
instead, try setting "batch-threads":-1
to use the total number of CPUs available on your system.
NOTE: this example also demonstrates how to specify a non-default ASR model using the "asr-model"
request option.
Run as a Docker container, the Engine can be wrapped with a REST API served over a secure communication layer:
curl https://mod9.io/rest/api/speech:recognize -X POST -H 'Content-Type: application/json'\
-d '{"audio":{"uri":"https://mod9.io/SW_4824_B.wav"},"config":{"languageCode":"en-us"}}'
NOTE: this endpoint is designed for compatibility with the Google STT REST API, but further extends their functionality:
the audio URI may be specified as https://
(or http://
) instead of only gs://
, and its duration is not limited to 60s.
For live demonstration on a Mac (using brew to install SoX), stream audio from the microphone:
# This tool can read from an audio device (-d) and convert to 1-channel 16-bit WAV (-c1 -b16 -twav).
brew install sox
# The `"partial"` request option will output each word as it is recognized.
(echo '{"partial":true}'; sox -qV1 -d -c1 -b16 -twav - ) | nc localhost 9900
In this online mode, the Engine is suitable for real-time streaming; the example above should only use about 10-20% CPU.
NOTE: Docker Desktop for Mac may be needed for localhost
;
or use mod9.io
and beware of privacy implications.
As a more secure alternative to the previous example, install the Python SDK to run a command-line WebSocket client:
pip3 install mod9-asr
sox -qV1 -d -c1 -b16 -twav - | mod9-asr-websocket-client wss://mod9.io '{"partial":true}'
See also the websocket-demo.html webpage; its embedded Javascript streams microphone audio to wss://mod9.io
.
NOTE: run a Docker container with your own mounted SSL certificate, then replace mod9.io
with your server's domain.
For ws://localhost
or otherwise unsecured WebSocket use, replace the https-engine
entrypoint with http-engine
.
The Engine can improve accuracy with client-specified phrase biasing,
a.k.a. "boosting" in Google STT parlance.
To demonstrate, try to recognize hi.wav with a very small an inaccurate ASR model intended for testing purposes:
(echo '{"asr-model":"mod9/en-US_phone-smaller"}'; cat hi.wav) | nc mod9.io 9900
The Engine will not correctly transcribe "hi can you hear me"
.
However, the "phrase-biases"
option could slightly boost "can you hear"
(a plausible adaptation for contact center applications)
and recover the correct transcription:
(echo '{"asr-model":"mod9/en-US_phone-smaller","phrase-biases":{"can you hear":5}}'; cat hi.wav) | nc mod9.io 9900
NOTE: while Google STT also supports positive boosting, only the Engine allows negative biases for undesired phrases.
When possible word sequences are highly constrained, such as for directed dialog or voice commands, a custom grammar can also be specified with each request. This may also be suitable for embedded devices.
# Download a small grammar specifying potential commands for a voice-controlled text editor.
curl -O https://mod9.io/voice-editor-commands.json
# Download an example utterance of someone saying "copy previous word".
curl -O https://mod9.io/copy_previous_word.wav
(cat voice-editor-commands.json; cat copy_previous_word.wav) | nc mod9.io 9900
In examples above, the Engine can recognize user-specified phrases or grammars; this is independent for each request.
Another approach is to define custom words with the add-words
command to recognize specific pronunciations.
There is also an add-grammar
command to recognize a task-specific grammar alongside a large-vocabulary language model,
improving accuracy for number sequences with a structured pattern, for example; this feature is currently in "beta".
These model customizations happen on-the-fly, affecting any requests that are currently being processed.
This therefore requires cooperative clients, and the Engine operator must also enable the --models.mutable
option.
The shell-oriented examples above are further described in the TCP reference, which allows an application developer to directly interact with the Engine using standard network sockets. Several high-level wrappers are also provided:
- A Python SDK that is fully compatible as a drop-in replacement for the google-cloud-speech library.
- A REST API that can serve as a local alternative to the Google Cloud Speech-to-Text API (non-streaming JSON).
- A WebSocket interface including a server and example client code to enable streaming web applications.
- A C++ library implementing a network client to communicate with an Engine server.
In contrast to a client who interacts with the Engine, an operator is responsible for deploying and maintaining it as a server. We provide a deployment guide with further documentation addressed to that role's perspective.
A third role is that of the model packager. Documentation of the expected model structure is provided so that the Engine may load any ASR models that are trained or built using Kaldi software. This generally requires specialized ASR expertise.
This documentation includes some additional endpoints that might be convenient:
To learn more about the Mod9 ASR Engine and its breadth of capabilities, ask help@mod9.com for a demonstration.
©2019-2023 Mod9 Technologies (Version 1.9.5)