[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]
The Mod9 ASR REST API is a higher-level interface than the protocol described in the TCP reference documentation. Designed as a compatible drop-in replacement for the Google Cloud STT REST API, it also extends functionality beyond that offered by Google. It is a lightweight wrapper built using open-source libraries such as Flask-RESTful.
To use the Mod9 ASR REST API deployed on a publicly accessible evaluation server:
curl https://mod9.io/rest/api/speech:recognize -H 'Content-Type: application/json' \
-d '{"audio":{"uri":"https://mod9.io/hi.wav"},"config":{"enableAutomaticPunctuation":true}}'
Note that this does not require authentication, and the audio URI can start with https://
(not limited to gs://
).
This also implements unique functionality that is not supported by Google Cloud STT:
curl https://mod9.io/rest/api/speech:recognize -H 'Content-Type: application/json' \
-d '{"audio":{"uri":"https://mod9.io/hi.wav"},"config":{"maxPhraseAlternatives":3}}'
Audio files of any duration can be processed as a synchronous request, with exceptional speed:
time curl https://mod9.io/rest/api/speech:recognize -H 'Content-Type: application/json' \
-d '{"audio":{"uri":"https://mod9.io/SW_4824_B.wav"},"config":{"languageCode":"en-us"}}'
Easily run the server locally on your own machine:
docker run -d --name=mod9-asr -p 8080:80 mod9/asr http-engine
curl localhost:8080/rest/api/operations/
docker rm -f mod9-asr
Or on a remote server secured over HTTPS (free SSL certificate registration included):
docker run -it --rm -p 80:80 -p 443:443 mod9/asr https-engine
As with the Mod9 ASR Python SDK, there are some notable differences with respect to Google's service:
-
Google's
audio.uri
only allows files to be retrieved from Google Cloud Storage.
The Mod9 ASR REST API accepts audio from more diverse sources:URI Scheme Access files stored ... gs://
in Google Cloud Storage, s3://
as AWS S3 objects, http://
orhttps://
via arbitrary HTTP services, file://
or on a local filesystem. -
Google only accepts audio files of 60 seconds or less using the
audio.content
input field.
The Mod9 ASR REST API does not limit the duration of audio inputs. -
Google's synchronous
/speech:recognize
endpoint only accepts audio files of 60 seconds or less.
The Mod9 ASR REST API does not limit the duration of audio accepted at the synchronous endpoint. -
Google supports a large number of languages for a variety of acoustic conditions.
Mod9 ASR packages over 50 models for about 20 languages and dialects -- or bring your own models.
The Mod9 ASR REST API supports a subset of Google's functionality, while also extending some unique features.
The configuration options supported are tabulated below:
Option in config
|
Accepted Google-compatible values | Extended Mod9 support |
---|---|---|
asr_model |
N/A | Select from loaded models |
audio_channel_count 1
|
N/A | Integer |
enable_automatic_punctuation 2
|
False , True
|
|
enable_separate_recognition_per_channel 3
|
N/A | True |
enable_word_confidence |
False , True
|
|
enable_word_time_offsets |
False , True
|
|
encoding |
"LINEAR16" "MULAW"
|
"ALAW" , "LINEAR24" , "LINEAR32" , "FLOAT32"
|
language_code |
(~20 languages/dialects) | |
latency 4
|
N/A |
0.01 , ... , 3.0
|
max_alternatives 5
|
0 , ... , 1000
|
|
max_phrase_alternatives 6
|
N/A |
1 , ... , 10000
|
max_word_alternatives 7
|
N/A |
1 , ... , 10000
|
model |
"video" , "phone_call" , "default"
|
|
intervals_json 8
|
N/A | "[[Number, Number], …]" |
options_json 9
|
N/A | "{…}" |
sample_rate_hertz |
8000 , ..., 48000
|
|
speed 10
|
N/A |
1 , ... , 9
|
1 Mod9 ASR: this is optional for non-raw audio. Internally, the Engine has a restriction on the number of channels.
2 Mod9 ASR: enabling punctuation also applies capitalization and number formatting.
3 Mod9 ASR: default is True
and Mod9 does not support a value of False
wherein only the first channel is recognized.
4 Mod9 ASR: lower values may improve responsiveness, higher values may decrease CPU usage; default is 0.24 seconds.
5 Google STT: only allows up to 30 transcript-level alternatives (i.e. N-best) to be requested, but often results in fewer.
6 Mod9 ASR: more useful representation of ambiguity in speech, as short sequences of many-to-many word mappings.
7 Mod9 ASR: a more compact representation, but restricted as one-to-one word mappings. (cf. IBM Watson STT API)
8 Mod9 ASR: provide a speech segmentation, useful for ensuring that results are aligned with speaker turns.
9 Mod9 ASR: arbitrary request options to the Mod9 ASR Engine may specified to override or extend functionality.
10 Mod9 ASR: lower values may improve recognition alternatives, higher values may decrease CPU usage; default is 5.
The REST API can be deployed using the Mod9 ASR Docker image (recommended) or as a standalone Python Flask app.
For local development or when encrypted transport is not required,
the REST API can be served on localhost
,
with port 8080 forwarded to the Docker container's port 80, for example:
docker run -it --rm -p 8080:80 mod9/asr http-engine
For the example usage section that follows, define an environment variable (in another terminal window):
REST_API=http://localhost:8080/rest/api
To deploy on a remotely accessible server, the REST API should generally be deployed over encrypted HTTPS:
docker run -it --rm -p 80:80 -p 443:443 mod9/asr https-engine
This will interactively register a free SSL certificate issued by Let's Encrypt.
Assuming you control the global DNS record of example.com
,
define an environment variable (in another terminal window) for later use in the example usage:
REST_API=https://example.com/rest/api
Beware of rate limits if certificates are interactively re-registered frequently, such as during active development.
For non-interactive deployment, the SSL credentials could also be cached and mounted from the host filesystem instead:
docker run -d -p 443:443 -v /etc/letsencrypt:/etc/letsencrypt mod9/asr https-engine
Note that the https-engine
entrypoint also runs a certbot
daemon that will automatically renew the certificate.
For a more scalable production deployment, a better recommendation is to use a load balancer with SSL termination.
Click to expand
If deployed in a Docker container, the Mod9 ASR REST API is managed via an Apache webserver and WSGI middleware.
This should robustly load balance across request-handling threads, and (re)start processes for graceful error recovery.
Running the REST API more directly as a standalone application is
not generally recommended for deployment,
though it might be convenient in certain situations: for example, when developing extensions to this
open-source Python Flask app
to add domain-specific functionality such as user authentication or a persistent datastore for the operations
endpoint.
Install the Mod9 ASR Python SDK, which includes the REST API as a Flask app:
pip3 install mod9-asr
The REST API must connect to an ASR Engine server to transcribe audio.
It may be most expedient to use the evaluation server running at mod9.io
:
export ASR_ENGINE_HOST=mod9.io
However, because this TCP transport is unencrypted and traverses the public Internet, customers are strongly advised that sensitive data should not be sent to this evaluation server. No data privacy is implied, nor service level promised.
The ASR Engine can also be run locally on bare-metal Linux, or in a Docker container. See installation instructions.
The REST API server can be launched using a script
that is installed during the pip3
installation.
The server connects to a host and port that are controlled either
by command-line options, as demonstrated below, or by the environment
variables ASR_ENGINE_HOST
and ASR_ENGINE_PORT
, respectively (with
defaults of localhost
and 9900
).
For example, if the ASR Engine is running locally, exposed at port
9900, the standalone Flask app can be launched as:
mod9-asr-rest-api
Or to connect to the mod9.io evaluation server:
mod9-asr-rest-api --host=mod9.io
This REST API will listen by default at 127.0.0.1:8080
.
Sensitive data should not be posted to the mod9.io Engine evaluation server as no attempt is made to provide data privacy or transport encryption. Additionally, as this is a service provided for convenience of evaluation to Mod9 customers, there is no SLA.
The REST API can be configured to retrieve audio
from a subset of the available URI schemes.
By default, none of the schemes
(file://
, gs://
, http://
, https://
, and s3://
)
are allowed.
URI schemes are individually allowed by passing corresponding
flags at runtime.
E.g., to only allow http://
and https://
, use:
mod9-asr-rest-api --allow-http-uri --allow-https-uri
Schemes can also be allowed by passing a comma-delimited string via
the environment variable ASR_REST_API_ALLOWED_URI_SCHEMES
, e.g.
to enable file://
, gs://
, and s3://
schemes,
ASR_REST_API_ALLOWED_URI_SCHEMES=file,gs,s3 mod9-asr-rest-api
The behaviors of the file://
, gs://
, and s3://
schemes
are worth clarifying because they may be unexpected.
These schemes are for URIs in reference to the REST API server, not client.
For example, passing a URI of file:///path/to/audio.wav
is
referring to a file on the REST API server host, not the
client's machine.
Similarly, gs://
and s3://
require the REST API server host
to have Google Cloud or AWS authorization configured to access the given files -- any authorization
that the client has will not be used, which differs from the behavior of the Google Cloud STT service.
These URIs can be particularly useful, then, when running a REST API on the same machine that client requests are sent from --
or when the client and remote server have the same level of cloud service authorizations.
To prepare for the example usage section below, define an environment variable for this local REST API:
export REST_API=http://localhost:8080
where 8080
is the default port of the standalone Flask app intended for development purposes.
Google Cloud credentials are needed for comparing between the Google Cloud STT REST API and the Mod9 ASR REST API. Load your own (with STT permissions), or use this: gstt-demo-credentials.json.
These demo credentials are provided by Mod9 for convenience of testing and are shared with multiple customers. Sensitive data should not be sent with the demo credentials. If usage exceeds a relatively low daily quota, the credentials might be temporarily disabled for the rest of the day. The line below will download and activate the demo credentials.
curl -sLO https://mod9.io/gstt-demo-credentials.json
gcloud auth activate-service-account gstt-demo@mod9-demo.iam.gserviceaccount.com --key-file=gstt-demo-credentials.json
The examples below outline usage of the transcription endpoints provided by the REST API.
In the
server deployment instructions above,
the environment variable REST_API
is set based on setup method.
To instead use the publicly accessible evaluation server, set the environment variable as:
export REST_API=https://mod9.io/rest/api
Each request to the Google Cloud STT REST API (as well as the compatible Mod9 ASR REST API)
consists of a JSON object with two top-level fields: config
and audio
.
Field | Description |
---|---|
config |
Metadata about the audio, such as the sample rate, as well as processing options such as whether to output word-level confidence estimates or the time intervals over which each word is spoken. |
audio |
Contains either a content field which must be Base64-encoded bytes, or a uri field specifying an audio file's location. (Google only allows gs:// . Mod9 can allow file:// , gs:// , http:// , s3:// ). |
Unfortunately, Google restricts the duration of audio accepted at this endpoint to 60 seconds or less. The Mod9 ASR REST API does not restrict the duration of accepted audio at the synchronous endpoint, allowing for simpler access to ASR transcription for use cases that have audio longer than 60 seconds and do not require an asynchronous response.
The following creates a JSON request encoding hello_world.wav, an audio file that should be transcribed as "hello world".
curl -sLO mod9.io/hello_world.wav
echo '{"audio": {"content": "'$(base64 < hello_world.wav | tr -d '\n')'"},
"config": {"sampleRateHertz": 8000, "languageCode": "en-us"}}' > audio-content-request.json
Send this request to the REST API using the following curl
command:
curl ${REST_API}/speech:recognize -H 'Content-Type: application/json' -d @audio-content-request.json
To send a request that makes use of a web-hosted file and the audio.uri
field:
echo '{"audio": {"uri": "https://mod9.io/hello_world.wav"},
"config": {"sampleRateHertz": 8000, "languageCode": "en-us"}}' > audio-uri-request.json
curl ${REST_API}/speech:recognize -H 'Content-Type: application/json' -d @audio-uri-request.json
Requests longer than 60 seconds can also be processed. For example, try this 5-minute audio file:
echo '{"audio": {"uri": "https://mod9.io/SW_4824_B.wav"},
"config": {"sampleRateHertz": 8000, "languageCode": "en-us"}}' > long-audio-uri-request.json
curl ${REST_API}/speech:recognize -H 'Content-Type: application/json' -d @long-audio-uri-request.json
Depending on the ASR Engine configured for ${REST_API}
,
a response could be received in less than 4 seconds — nearly 100x faster than real-time!
Because of this extraordinary speed, the synchronous /speech:recognize
endpoint can likely be
used for the majority of real-world use cases (e.g. audio less than 1-hour duration), without
triggering HTTP or TCP timeouts (typically these are on the order of 30-90s, and applied at
various levels of the application or network stack).
With Google Cloud credentials loaded, compare the response of Google to that of the Mod9 ASR REST API above:
curl https://speech.googleapis.com/v1p1beta1/speech:recognize \
-H 'Authorization: Bearer '$(gcloud auth print-access-token) \
-H 'Content-Type: application/json' -d @audio-content-request.json
Since Google does not allow audio URIs other than on Google Storage (gs://
),
and because their synchronous endpoint limits audio to less than 60 seconds duration,
the following request with the 5-minute audio file will fail:
curl https://speech.googleapis.com/v1p1beta1/speech:recognize \
-H 'Authorization: Bearer '$(gcloud auth print-access-token) \
-H 'Content-Type: application/json' -d @long-audio-uri-request.json
The Mod9 ASR REST API wrapper also supports an asynchronous endpoint.
Note that requests longer than 60 seconds must be made using the audio.uri
format to Google Cloud STT REST API (and specifically must be
stored on Google Cloud Storage);
the Mod9 ASR REST API wrapper offers the flexibility to use
audio.content
for audio longer than 60 seconds or to use
the audio.uri
interface broader support for additional URI schemes.
The /speech:longrunningrecognize
endpoint has a different response, as well
as a different way to retrieve the transcription results.
The response to a properly formatted request is an
Operation JSON object
that contains a name
field.
The name
s of asynchronous requests can be listed at the
/operations/
endpoint; the status and completed results can be viewed by
appending the request name
to the /operations/
endpoint
as demonstrated below.
The following command modifies the JSON request for the longer 5-minute audio, but now further enables word-level confidence scores (which are also supported by the synchronous endpoint above), while requesting a lower speed setting. These options intentionally slow down processing in order to demonstrate the asynchronous polling:
echo '{"audio": {"uri": "https://mod9.io/SW_4824_B.wav"},
"config": {"sampleRateHertz": 8000, "languageCode": "en-us",
"enableWordConfidence": true, "speed": 1}}' > slow-request.json
Submit a request to the asynchronous endpoint and parse the name
field in the response JSON:
name=$(curl -s ${REST_API}/speech:longrunningrecognize \
-H 'Content-Type: application/json' \
-d @slow-request.json | grep '"name"' | sed -E 's,.*"name": "(.*)",\1,')
(This is much nicer using the jq
tool, e.g. name=$(curl ... | jq -r .name)
)
The name
can then be polled with the /operations/
endpoint to check on the
status of the request:
curl ${REST_API}/operations/$name
This polling request can be repeated periodically until the processing is completed, which may take about 30 seconds for this example, with results finally contained in the subsequent JSON responses.
All the name
values of requests submitted since
the server was started can be viewed at the /operations/
endpoint.
curl ${REST_API}/operations/
Note: asynchronous results are stored in memory and will not persist if the Mod9 ASR REST API server is restarted.
With Google Cloud authentication properly loaded, compare the response
of Google to that of the Mod9 ASR REST API above, capturing the
name
response:
name=$(curl -s https://speech.googleapis.com/v1p1beta1/speech:longrunningrecognize \
-H 'Authorization: Bearer '$(gcloud auth print-access-token) \
-H 'Content-Type: application/json' \
-d @audio-content-request.json | grep '"name"' | sed -E 's,.*"name": "(.*)",\1,')
The status and results (when finished) can be checked:
curl https://speech.googleapis.com/v1p1beta1/operations/$name -H 'Authorization: Bearer '$(gcloud auth print-access-token)
The Mod9 ASR REST API is intended as a starting point for convenient development.
For deployment in a production system, it may be necessary to extend the
open-source Python Flask app
to add domain-specific functionality such as user authentication or a persistent datastore for the operations
endpoint.
Our team would be glad to provide assistance in any way: please contact help@mod9.com.
Lastly, note that running this REST API via the Docker image has a side-effect of running a Mod9 ASR Websocket server as well. This addresses a notable deficiency in the Google Cloud STT offering, implementing an industry-standard protocol that can enable full-duplex streaming in browser-based applications, over encrypted HTTPS/WSS transport.
©2019-2022 Mod9 Technologies (Engine 1.9.5 : Python SDK 1.11.6)