Python SDK | Mod9 ASR Engine

[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]

Mod9 ASR Python SDK

The Mod9 ASR Python SDK is a higher-level interface than the protocol described in the TCP reference documentation. Designed as a compatible drop-in replacement for the Google Cloud STT Python Client Library, Mod9's software enables privacy-protecting on-premise deployment, while also extending functionality of the Google Cloud service.

Contents

Quick start

Install the Mod9 ASR Python SDK:

pip3 install mod9-asr

To transcribe a sample audio file accessible at https://mod9.io/hello_world.wav:

from mod9.asr.speech import SpeechClient

client = SpeechClient(host='mod9.io', port=9900)

response = client.recognize(config={'language_code': 'en-US'},
                            audio={'uri': 'https://mod9.io/hello_world.wav'})

print(response)

This Python code instantiates the SpeechClient class with arguments that specify how to connect with a server running the ASR Engine. For convenience, such a server is deployed at mod9.io, listening on port 9900.

NOTE: Sensitive data should not be sent to this evaluation server, because the TCP connection is unencrypted.

For comparison, here's how the above code might look using Google's client library (and supplied credentials):

import json
from urllib.request import urlopen
from google.oauth2 import service_account

# The mod9.asr.speech module is a drop-in replacement for this module.
from google.cloud.speech import SpeechClient

# For demonstration purposes, use Mod9's GCP credentials (subject to a daily billing quota).
client = SpeechClient(credentials=service_account.Credentials.from_service_account_info(
    json.load(urlopen('https://mod9.io/gstt-demo-credentials.json'))))

# Note that Google only support audio URIs at Google Cloud Storage (gs:// scheme).
response = client.recognize(config={'language_code': 'en-US'},
                            audio={'uri': 'gs://gstt-demo-audio/hello_world.wav'})

print(response)

Mod9's implementation of SpeechClient replicates Google's recognize() method, which is synchronous: it processes an entire request before returning a single response. This is suitable for transcribing pre-recorded audio files.

The config argument can be either a Python dict or RecognitionConfig object that contains metadata about the audio input, as well as supported configuration options that affect the output.

The audio argument can be either a Python dict or RecognitionAudio object that contains either content or uri. While content represents audio bytes directly, uri specifies a location where audio may be accessed.

The output from recognize() is a RecognizeResponse that may contain SpeechRecognitionResult objects.

Alternatively, as demonstrated in further example usage below, the streaming_recognize() method is used to return a generator that yields StreamingRecognitionResult objects while a real-time audio stream is being sent and processed.

Differences from Google Cloud STT

There are some notable differences:

  1. Google's RecognitionAudio.uri only allows files to be retrieved from Google Cloud Storage.
    The Mod9 ASR Python SDK accepts audio from more diverse sources:

    URI Scheme Access files stored ...
    gs:// in Google Cloud Storage,
    s3:// as AWS S3 objects,
    http:// or https:// via arbitrary HTTP services,
    file:// or on a local filesystem.
  2. Google's RecognitionAudio.content and SpeechClient.recognize() restrict audio to be less than 60 seconds.
    The Mod9 ASR Python SDK does not limit the duration of audio.

  3. Google's SpeechClient.streaming_recognize() restricts audio to be less than 5 minutes.
    The Mod9 ASR Python SDK does not limit the duration of streaming audio.

  4. Google's SpeechClient.long_running_recognize() can asynchronously process longer audio files.
    The Mod9 ASR Python SDK has not replicated this; it's better served with a Google-compatible Mod9 ASR REST API.

  5. Google Cloud STT supports a large number of languages for a variety of acoustic conditions.
    Mod9 ASR packages over 50 models for about 20 languages and dialects -- or bring your own models.

[top]

Supported configuration options

The Mod9 ASR Python SDK provides two modules:

  • mod9.asr.speech implements a strict subset of Google's functionality.
  • mod9.asr.speech_mod9 extends this with additional functionality that Google does not support.
Option in config Accepted values in
mod9.asr.speech
Extended support in
mod9.asr.speech_mod9
asr_model N/A Select from loaded models
audio_channel_count1 N/A Integer
enable_automatic_punctuation2 False, True
enable_separate_recognition_per_channel3 N/A True
enable_word_confidence False, True
enable_word_time_offsets False, True
encoding "LINEAR16" "MULAW" "ALAW", "LINEAR24", "LINEAR32", "FLOAT32"
language_code (~20 languages/dialects)
latency4 N/A 0.01, ... , 3.0
max_alternatives5 0, ... , 1000
max_phrase_alternatives6 N/A 1, ... , 10000
max_word_alternatives7 N/A 1, ... , 10000
model "video", "phone_call", "default"
intervals_json8 N/A "[[Number, Number], …]"
options_json9 N/A "{…}"
sample_rate_hertz 8000, ..., 48000
speed10 N/A 1, ... , 9

1 Mod9 ASR: this is optional for non-raw audio. Internally, the Engine has a restriction on the number of channels.
2 Mod9 ASR: enabling punctuation also applies capitalization and number formatting.
3 Mod9 ASR: default is True and Mod9 does not support a value of False wherein only the first channel is recognized.
4 Mod9 ASR: lower values may improve responsiveness, higher values may decrease CPU usage; default is 0.24 seconds.
5 Google STT: only allows up to 30 transcript-level alternatives (i.e. N-best) to be requested, but often results in fewer.
6 Mod9 ASR: more useful representation of ambiguity in speech, as short sequences of many-to-many word mappings.
7 Mod9 ASR: a more compact representation, but restricted as one-to-one word mappings. (cf. IBM Watson STT API)
8 Mod9 ASR: provide a speech segmentation, useful for ensuring that results are aligned with speaker turns.
9 Mod9 ASR: arbitrary request options to the Mod9 ASR Engine may specified to override or extend functionality.
10 Mod9 ASR: lower values may improve recognition alternatives, higher values may decrease CPU usage; default is 5.

[top]

Installation and setup

Install the Mod9 ASR Python SDK from PyPI:

pip3 install mod9-asr

Connect to the Mod9 ASR Engine

The Python SDK must connect to an ASR Engine server to transcribe audio.

It may be most expedient to use the evaluation server running at mod9.io:

export ASR_ENGINE_HOST=mod9.io

However, because this TCP transport is unencrypted and traverses the public Internet, customers are strongly advised that sensitive data should not be sent to this evaluation server. No data privacy is implied, nor service level promised.

The ASR Engine can also be run locally on bare-metal Linux, or in a Docker container. See installation instructions.

Compare with Google Cloud STT (optional)

This Mod9 ASR Python SDK is designed to emulate the Google Cloud STT Python Client Library, and we encourage developers to compare our respective software and services side-by-side to ensure compatibility.

Google Cloud credentials are required for such comparisons, so we share gstt-demo-credentials.json to facilitate testing. Download and enable these demo credentials by setting an environment variable in your current shell:

curl -O https://mod9.io/gstt-demo-credentials.json
export GOOGLE_APPLICATION_CREDENTIALS=gstt-demo-credentials.json

Sensitive data should not be used with these shared demo credentials, as it could be seen by other users who are testing. A daily quota is set to prevent abuse of these limited-use credentials, so Google's service may at times be unavailable.

[top]

Example usage

The Mod9 ASR Python SDK is a drop-in replacement for the Google Cloud STT Python Client Library.
To demonstrate this compatibility, consider the sample scripts published by Google:

To download Google's sample scripts with a command-line tool:

curl -LO github.com/googleapis/python-speech/raw/main/samples/snippets/transcribe.py
curl -LO github.com/googleapis/python-speech/raw/main/samples/snippets/transcribe_auto_punctuation.py
curl -LO github.com/googleapis/python-speech/raw/main/samples/microphone/transcribe_streaming_mic.py

Modify lines that call from google.cloud import speech to now use mod9.asr, for example with a stream editor:

sed s/google.cloud/mod9.asr/ transcribe.py > transcribe_mod9.py
sed s/google.cloud/mod9.asr/ transcribe_auto_punctuation.py > transcribe_auto_punctuation_mod9.py
sed s/google.cloud/mod9.asr/ transcribe_streaming_mic.py > transcribe_streaming_mic_mod9.py

The mod9ified sample scripts are named as *_mod9.py and differ only in the import lines. To verify this:

diff transcribe.py transcribe_mod9.py

The modified scripts do not communicate with Google Cloud; the following example usage can even be demonstrated on a laptop with no Internet connection — e.g. if the Mod9 ASR Engine is deployed on localhost.

Transcribe audio files with recognize()

Download sample audio files, greetings.wav (2s @ 16kHz) and SW_4824_B.wav (5m @ 8kHz):

curl -L -O mod9.io/greetings.wav -O mod9.io/SW_4824_B.wav

Run the modified sample script:

python3 transcribe_mod9.py greetings.wav

If it can connect to the Mod9 ASR Engine, the script should print Transcript: greetings world.

Google's recognize() method only allows audio duration up to 60 seconds. To demonstrate that Mod9 ASR extends support for arbitrarily long durations, run another script (which is configured for 8kHz audio and transcript formatting):

python3 transcribe_auto_punctuation_mod9.py SW_4824_B.wav

To compare with Google Cloud STT (optional), run the original unmodified scripts:

python3 transcribe.py greetings.wav
python3 transcribe_auto_punctuation.py SW_4824_B.wav

The first script produces the same result as Mod9 ASR; meanwhile, Google STT will fail to process the longer audio file.

Transcribe live audio with streaming_recognize()

The streaming scripts will require PortAudio and PyAudio for OS-dependent microphone access. To install on a Mac:

brew install portaudio && pip3 install pyaudio

Running this sample script will record audio from your microphone and print results in real-time:

python3 transcribe_streaming_mic_mod9.py

To compare with Google Cloud STT (optional), run the unmodified script:

python3 transcribe_streaming_mic.py

It can be especially helpful to run both of these scripts at the same time, comparing side-by-side in different windows. Note that the unmodified script using Google STT will eventually disconnect after reaching their 5-minute streaming limit.

[top]

Next steps

See also the Mod9 ASR REST API, which can run a Google-compatible service that is accessible to HTTP clients.
This is especially recommended for asynchronous batch-processing workloads, with a POST followed by GET.

The TCP reference documentation describes the lower-level protocol that is abstracted by the Python SDK and REST API. The TCP interface can enable more extensive functionality, including user-defined words and domain-specific grammar.

Advanced configuration of the Mod9 ASR Engine is described in the deployment guide.
Contact support@mod9.com for additional assistance.

[top]


©2019-2022 Mod9 Technologies (Engine 1.9.3 : Python SDK 1.11.2)