REST API | Mod9 ASR Engine

[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]

Mod9 ASR REST API

The Mod9 ASR REST API is a higher-level interface than the protocol described in the TCP reference documentation. Designed as a compatible drop-in replacement for the Google Cloud STT REST API, it also extends functionality beyond that offered by Google. It is a lightweight wrapper built using open-source libraries such as Flask-RESTful.

Contents

Quick start

To use the Mod9 ASR REST API deployed on a publicly accessible evaluation server:

curl https://mod9.io/rest/api/speech:recognize -H 'Content-Type: application/json' \
     -d '{"audio":{"uri":"https://mod9.io/hi.wav"},"config":{"enableAutomaticPunctuation":true}}'

Note that this does not require authentication, and the audio URI can start with https:// (not limited to gs://).

This also implements unique functionality that is not supported by Google Cloud STT:

curl https://mod9.io/rest/api/speech:recognize -H 'Content-Type: application/json' \
     -d '{"audio":{"uri":"https://mod9.io/hi.wav"},"config":{"maxPhraseAlternatives":3}}'

Audio files of any duration can be processed as a synchronous request, with exceptional speed:

time curl https://mod9.io/rest/api/speech:recognize -H 'Content-Type: application/json' \
     -d '{"audio":{"uri":"https://mod9.io/SW_4824_B.wav"},"config":{"languageCode":"en-us"}}'

Easily run the server locally on your own machine:

docker run -d --name=mod9-asr -p 8080:80 mod9/asr http-engine
curl localhost:8080/rest/api/operations/
docker rm -f mod9-asr

Or on a remote server secured over HTTPS (free SSL certificate registration included):

docker run -it --rm -p 80:80 -p 443:443 mod9/asr https-engine

Differences from Google Cloud STT

As with the Mod9 ASR Python SDK, there are some notable differences with respect to Google's service:

  1. Google's audio.uri only allows files to be retrieved from Google Cloud Storage.
    The Mod9 ASR REST API accepts audio from more diverse sources:

    URI Scheme Access files stored ...
    gs:// in Google Cloud Storage,
    s3:// as AWS S3 objects,
    http:// or https:// via arbitrary HTTP services,
    file:// or on a local filesystem.
  2. Google only accepts audio files of 60 seconds or less using the audio.content input field.
    The Mod9 ASR REST API does not limit the duration of audio inputs.

  3. Google's synchronous /speech:recognize endpoint only accepts audio files of 60 seconds or less.
    The Mod9 ASR REST API does not limit the duration of audio accepted at the synchronous endpoint.

  4. Google supports a large number of languages for a variety of acoustic conditions.
    Mod9 ASR packages over 50 models for about 20 languages and dialects -- or bring your own models.

[top]

Supported configuration options

The Mod9 ASR REST API supports a subset of Google's functionality, while also extending some unique features.
The configuration options supported are tabulated below:

Option in config Accepted Google-compatible values Extended Mod9 support
asr_model N/A Select from loaded models
audio_channel_count1 N/A Integer
enable_automatic_punctuation2 False, True
enable_separate_recognition_per_channel3 N/A True
enable_word_confidence False, True
enable_word_time_offsets False, True
encoding "LINEAR16" "MULAW" "ALAW", "LINEAR24", "LINEAR32", "FLOAT32"
language_code (~20 languages/dialects)
latency4 N/A 0.01, ... , 3.0
max_alternatives5 0, ... , 1000
max_phrase_alternatives6 N/A 1, ... , 10000
max_word_alternatives7 N/A 1, ... , 10000
model "video", "phone_call", "default"
intervals_json8 N/A "[[Number, Number], …]"
options_json9 N/A "{…}"
sample_rate_hertz 8000, ..., 48000
speed10 N/A 1, ... , 9

1 Mod9 ASR: this is optional for non-raw audio. Internally, the Engine has a restriction on the number of channels.
2 Mod9 ASR: enabling punctuation also applies capitalization and number formatting.
3 Mod9 ASR: default is True and Mod9 does not support a value of False wherein only the first channel is recognized.
4 Mod9 ASR: lower values may improve responsiveness, higher values may decrease CPU usage; default is 0.24 seconds.
5 Google STT: only allows up to 30 transcript-level alternatives (i.e. N-best) to be requested, but often results in fewer.
6 Mod9 ASR: more useful representation of ambiguity in speech, as short sequences of many-to-many word mappings.
7 Mod9 ASR: a more compact representation, but restricted as one-to-one word mappings. (cf. IBM Watson STT API)
8 Mod9 ASR: provide a speech segmentation, useful for ensuring that results are aligned with speaker turns.
9 Mod9 ASR: arbitrary request options to the Mod9 ASR Engine may specified to override or extend functionality.
10 Mod9 ASR: lower values may improve recognition alternatives, higher values may decrease CPU usage; default is 5.

[top]

Server deployment

The REST API can be deployed using the Mod9 ASR Docker image (recommended) or as a standalone Python Flask app.

Development: run REST API server in a local Docker container

For local development or when encrypted transport is not required, the REST API can be served on localhost, with port 8080 forwarded to the Docker container's port 80, for example:

docker run -it --rm -p 8080:80 mod9/asr http-engine

For the example usage section that follows, define an environment variable (in another terminal window):

REST_API=http://localhost:8080/rest/api

Deployment: run REST API server on a remote Docker container

To deploy on a remotely accessible server, the REST API should generally be deployed over encrypted HTTPS:

docker run -it --rm -p 80:80 -p 443:443 mod9/asr https-engine

This will interactively register a free SSL certificate issued by Let's Encrypt. Assuming you control the global DNS record of example.com, define an environment variable (in another terminal window) for later use in the example usage:

REST_API=https://example.com/rest/api

Beware of rate limits if certificates are interactively re-registered frequently, such as during active development.

For non-interactive deployment, the SSL credentials could also be cached and mounted from the host filesystem instead:

docker run -d -p 443:443 -v /etc/letsencrypt:/etc/letsencrypt mod9/asr https-engine

Note that the https-engine entrypoint also runs a certbot daemon that will automatically renew the certificate.

For a more scalable production deployment, a better recommendation is to use a load balancer with SSL termination.

Alternative: run a standalone Python Flask app

Click to expand

Context

If deployed in a Docker container, the Mod9 ASR REST API is managed via an Apache webserver and WSGI middleware. This should robustly load balance across request-handling threads, and (re)start processes for graceful error recovery. Running the REST API more directly as a standalone application is not generally recommended for deployment, though it might be convenient in certain situations: for example, when developing extensions to this open-source Python Flask app to add domain-specific functionality such as user authentication or a persistent datastore for the operations endpoint.

Install the Mod9 ASR Python SDK

Install the Mod9 ASR Python SDK, which includes the REST API as a Flask app:

pip3 install mod9-asr

Provide an ASR Engine server

The REST API must connect to an ASR Engine server to transcribe audio.

It may be most expedient to use the evaluation server running at mod9.io:

export ASR_ENGINE_HOST=mod9.io

However, because this TCP transport is unencrypted and traverses the public Internet, customers are strongly advised that sensitive data should not be sent to this evaluation server. No data privacy is implied, nor service level promised.

The ASR Engine can also be run locally on bare-metal Linux, or in a Docker container. See installation instructions.

Run the REST API as a script

The REST API server can be launched using a script that is installed during the pip3 installation. The server connects to a host and port that are controlled either by command-line options, as demonstrated below, or by the environment variables ASR_ENGINE_HOST and ASR_ENGINE_PORT, respectively (with defaults of localhost and 9900). For example, if the ASR Engine is running locally, exposed at port 9900, the standalone Flask app can be launched as:

mod9-asr-rest-api

Or to connect to the mod9.io evaluation server:

mod9-asr-rest-api --host=mod9.io

This REST API will listen by default at 127.0.0.1:8080.

Sensitive data should not be posted to the mod9.io Engine evaluation server as no attempt is made to provide data privacy or transport encryption. Additionally, as this is a service provided for convenience of evaluation to Mod9 customers, there is no SLA.

The REST API can be configured to retrieve audio from a subset of the available URI schemes. By default, none of the schemes (file://, gs://, http://, https://, and s3://) are allowed. URI schemes are individually allowed by passing corresponding flags at runtime. E.g., to only allow http:// and https://, use:

mod9-asr-rest-api --allow-http-uri --allow-https-uri

Schemes can also be allowed by passing a comma-delimited string via the environment variable ASR_REST_API_ALLOWED_URI_SCHEMES, e.g. to enable file://, gs://, and s3:// schemes,

ASR_REST_API_ALLOWED_URI_SCHEMES=file,gs,s3 mod9-asr-rest-api

The behaviors of the file://, gs://, and s3:// schemes are worth clarifying because they may be unexpected. These schemes are for URIs in reference to the REST API server, not client. For example, passing a URI of file:///path/to/audio.wav is referring to a file on the REST API server host, not the client's machine. Similarly, gs:// and s3:// require the REST API server host to have Google Cloud or AWS authorization configured to access the given files -- any authorization that the client has will not be used, which differs from the behavior of the Google Cloud STT service. These URIs can be particularly useful, then, when running a REST API on the same machine that client requests are sent from -- or when the client and remote server have the same level of cloud service authorizations.

To prepare for the example usage section below, define an environment variable for this local REST API:

export REST_API=http://localhost:8080

where 8080 is the default port of the standalone Flask app intended for development purposes.

Optional: activate Google Cloud STT REST API

Google Cloud credentials are needed for comparing between the Google Cloud STT REST API and the Mod9 ASR REST API. Load your own (with STT permissions), or use this: gstt-demo-credentials.json.

These demo credentials are provided by Mod9 for convenience of testing and are shared with multiple customers. Sensitive data should not be sent with the demo credentials. If usage exceeds a relatively low daily quota, the credentials might be temporarily disabled for the rest of the day. The line below will download and activate the demo credentials.

curl -sLO https://mod9.io/gstt-demo-credentials.json
gcloud auth activate-service-account gstt-demo@mod9-demo.iam.gserviceaccount.com --key-file=gstt-demo-credentials.json

[top]

Example usage

The examples below outline usage of the transcription endpoints provided by the REST API. In the server deployment instructions above, the environment variable REST_API is set based on setup method.

To instead use the publicly accessible evaluation server, set the environment variable as:

export REST_API=https://mod9.io/rest/api

Sending synchronous requests to the /speech:recognize endpoint

Each request to the Google Cloud STT REST API (as well as the compatible Mod9 ASR REST API) consists of a JSON object with two top-level fields: config and audio.

Field Description
config Metadata about the audio, such as the sample rate, as well as processing options such as whether to output word-level confidence estimates or the time intervals over which each word is spoken.
audio Contains either a content field which must be Base64-encoded bytes, or a uri field specifying an audio file's location. (Google only allows gs://. Mod9 can allow file://, gs://, http://, s3://).

Unfortunately, Google restricts the duration of audio accepted at this endpoint to 60 seconds or less. The Mod9 ASR REST API does not restrict the duration of accepted audio at the synchronous endpoint, allowing for simpler access to ASR transcription for use cases that have audio longer than 60 seconds and do not require an asynchronous response.

The following creates a JSON request encoding hello_world.wav, an audio file that should be transcribed as "hello world".

curl -sLO mod9.io/hello_world.wav
echo '{"audio": {"content": "'$(base64 < hello_world.wav | tr -d '\n')'"},
       "config": {"sampleRateHertz": 8000, "languageCode": "en-us"}}' > audio-content-request.json

Send this request to the REST API using the following curl command:

curl ${REST_API}/speech:recognize -H 'Content-Type: application/json' -d @audio-content-request.json

To send a request that makes use of a web-hosted file and the audio.uri field:

echo '{"audio": {"uri": "https://mod9.io/hello_world.wav"},
       "config": {"sampleRateHertz": 8000, "languageCode": "en-us"}}' > audio-uri-request.json
curl ${REST_API}/speech:recognize -H 'Content-Type: application/json' -d @audio-uri-request.json

Requests longer than 60 seconds can also be processed. For example, try this 5-minute audio file:

echo '{"audio": {"uri": "https://mod9.io/SW_4824_B.wav"},
       "config": {"sampleRateHertz": 8000, "languageCode": "en-us"}}' > long-audio-uri-request.json
curl ${REST_API}/speech:recognize -H 'Content-Type: application/json' -d @long-audio-uri-request.json

Depending on the ASR Engine configured for ${REST_API}, a response could be received in less than 4 seconds — nearly 100x faster than real-time! Because of this extraordinary speed, the synchronous /speech:recognize endpoint can likely be used for the majority of real-world use cases (e.g. audio less than 1-hour duration), without triggering HTTP or TCP timeouts (typically these are on the order of 30-90s, and applied at various levels of the application or network stack).

Google Cloud STT REST API (optional, for comparison purposes)

With Google Cloud credentials loaded, compare the response of Google to that of the Mod9 ASR REST API above:

curl https://speech.googleapis.com/v1p1beta1/speech:recognize \
     -H 'Authorization: Bearer '$(gcloud auth print-access-token) \
     -H 'Content-Type: application/json' -d @audio-content-request.json

Since Google does not allow audio URIs other than on Google Storage (gs://), and because their synchronous endpoint limits audio to less than 60 seconds duration, the following request with the 5-minute audio file will fail:

curl https://speech.googleapis.com/v1p1beta1/speech:recognize \
     -H 'Authorization: Bearer '$(gcloud auth print-access-token) \
     -H 'Content-Type: application/json' -d @long-audio-uri-request.json

Sending asynchronous requests to the /speech:longrunningrecognize endpoint

The Mod9 ASR REST API wrapper also supports an asynchronous endpoint. Note that requests longer than 60 seconds must be made using the audio.uri format to Google Cloud STT REST API (and specifically must be stored on Google Cloud Storage); the Mod9 ASR REST API wrapper offers the flexibility to use audio.content for audio longer than 60 seconds or to use the audio.uri interface broader support for additional URI schemes.

The /speech:longrunningrecognize endpoint has a different response, as well as a different way to retrieve the transcription results. The response to a properly formatted request is an Operation JSON object that contains a name field. The names of asynchronous requests can be listed at the /operations/ endpoint; the status and completed results can be viewed by appending the request name to the /operations/ endpoint as demonstrated below.

The following command modifies the JSON request for the longer 5-minute audio, but now further enables word-level confidence scores (which are also supported by the synchronous endpoint above), while requesting a lower speed setting. These options intentionally slow down processing in order to demonstrate the asynchronous polling:

echo '{"audio": {"uri": "https://mod9.io/SW_4824_B.wav"},
       "config": {"sampleRateHertz": 8000, "languageCode": "en-us",
                  "enableWordConfidence": true, "speed": 1}}' > slow-request.json

Submit a request to the asynchronous endpoint and parse the name field in the response JSON:

name=$(curl -s ${REST_API}/speech:longrunningrecognize \
            -H 'Content-Type: application/json' \
            -d @slow-request.json | grep '"name"' | sed -E 's,.*"name": "(.*)",\1,')

(This is much nicer using the jq tool, e.g. name=$(curl ... | jq -r .name))

The name can then be polled with the /operations/ endpoint to check on the status of the request:

curl ${REST_API}/operations/$name

This polling request can be repeated periodically until the processing is completed, which may take about 30 seconds for this example, with results finally contained in the subsequent JSON responses.

All the name values of requests submitted since the server was started can be viewed at the /operations/ endpoint.

curl ${REST_API}/operations/

Note: asynchronous results are stored in memory and will not persist if the Mod9 ASR REST API server is restarted.

Google Cloud STT REST API (optional, for comparison purposes)

With Google Cloud authentication properly loaded, compare the response of Google to that of the Mod9 ASR REST API above, capturing the name response:

name=$(curl -s https://speech.googleapis.com/v1p1beta1/speech:longrunningrecognize \
            -H 'Authorization: Bearer '$(gcloud auth print-access-token) \
            -H 'Content-Type: application/json' \
            -d @audio-content-request.json | grep '"name"' | sed -E 's,.*"name": "(.*)",\1,')

The status and results (when finished) can be checked:

curl https://speech.googleapis.com/v1p1beta1/operations/$name -H 'Authorization: Bearer '$(gcloud auth print-access-token)

[top]

Next steps

The Mod9 ASR REST API is intended as a starting point for convenient development. For deployment in a production system, it may be necessary to extend the open-source Python Flask app to add domain-specific functionality such as user authentication or a persistent datastore for the operations endpoint.

Our team would be glad to provide assistance in any way: please contact help@mod9.com.

Lastly, note that running this REST API via the Docker image has a side-effect of running a Mod9 ASR Websocket server as well. This addresses a notable deficiency in the Google Cloud STT offering, implementing an industry-standard protocol that can enable full-duplex streaming in browser-based applications, over encrypted HTTPS/WSS transport.

[top]


©2019-2022 Mod9 Technologies (Engine 1.9.5 : Python SDK 1.11.6)