This directory contains library code for use in the Engine, internally named mrpboost.
Boost is used extensively in this library. There are some instances where
functionality is available in both the boost namespace and the std namespace. In these cases, we
generally prefer std except when functionality or compatibility depends on the boost version.
As usual with Kaldi library directories, the source files in this directory are compiled into a
library (kaldi-mrpboost.a or libkaldi-mrpboost.so), which linked into executables that use the
library. The code for the executables that use this library can be found under src/mrpboostbin.
One difference from a standard Kaldi library directory is that we use some additional Makefile rules
stored in extra.mk. These rules ensure that executables that use the library produced by the code
in this directory compile and link correctly. This mostly involves pointing to Boost libraries and
to other third-party libraries that the executable or the library depend on (JSON parsing, sample
rate conversion, TensorFlow, etc). These third-party libraries should all have been installed as
as build dependencies (see MOD9_BUILD_ENGINE.md in the top-level directory).
Note that the .h header files contain API-level documentation, whereas the the .cc source files
contain comments related to implementation.
Many of the files in this directory implement requests that a client can send to the Engine
server. This is done by subclassing Command, which can be found in command.{cc,h}. A Command
is created when a client sends a request (a newline terminated JSON string).
A Command is known as "lightweight" if it runs with minimal resource requirements and should not
count towards the resource limits specified when the Engine starts
(e.g. --limit.threads). Otherwise, the Command is known as "heavy". A Task is a subclass of
Command that is always "heavy" and should be used instead of Command if the request will use
significant resources.
Tasks that need to stream data to and from a client should be subclasses of TaskStreamBase, which
is defined in task-stream-base.{cc,h}. This base class handles the audio data portion of a request
that is streamed from the client, after the request's initial newline-delimited JSON options have
already been read and processed in server.cc.
A subclass of TaskStreamBase should handle anything specific to the subclass, such as converting
WAV formatted audio files, performing automatic speech recognition, etc.
There are several files that implement the client-requested recognize command. Originally, the
recognize command was known as decode; this is reflected in the current naming for the files
that implement the recognize command.
Separate processing workflows are implemented for online vs. batch operation. Depending on the
options passed with the recognize command, different Task subclasses are invoked:
- If
batch-threadsis not passed or is set to 0, andbatch-intervalsis not passed, then the command is handled byTaskDecode(defined intask-decode.{cc,h}). This online mode of operation is recommended for live streaming audio in real-time. - If
batch-threads != 0andbatch-intervalsis not set, thencommand.ccroutes it asdecode-parallel-vad, which is handled byTaskDecodeParallelVad(defined intask-decode-parallel-vad.{cc,h}). This uses Voice Activity Detection fromlibfvad. - If
batch-intervalsis set, then the command is rewritten incommand.ccto bedecode-parallel-intervals, which is handled byTaskDecodeParallelIntervals(defined intask-decode-parallel-intervals.{cc,h}). This is for client-provided VAD segmentation.
This diagram shows inheritance among classes that implement the recognize command.
Command
|
Task
|
TaskStreamBase
|
TaskDecodeBase
| |
TaskDecode TaskDecodeParallelBase
| |
TaskDecodeParallelVad TaskDecodeParallelIntervals
This section describes what happens in the Engine in response to a request from a client. The
RAII (Resource Allocation Is
Initialization) pattern is used extensively to ensure resources (e.g. threads, memory) are freed
when no longer needed. The Factory pattern
is used to allow creating of a subclass of Command based on the JSON sent by the client.
Note that various Engine settings are defined in engine-settings.{cc,h} and are in the namespace
Engine. For example, the integer Engine::read_line_limit is the maximum allowable line length
for specifying the initial JSON request options.
When the Engine executable starts, a server object is created using Server::create() (defined in
server.{cc,h}) that runs in a separate thread and begins listening for requests against a socket.
(see Server::start_accept()). The main thread primarily listens for signals, such as SIGTERM.
The server also spawns another thread to collect systems stats like CPU and memory usage.
If NLP models are loaded, three other threads will be started by TensorFlow.
Consider a recognize request with batch-threads set to 4:
(echo '{"batch-threads": 4}'; curl -sL https://mod9.io/swbd.wav) | nc $HOST $PORTSuch a request goes through the following set of steps.
-
The
serverobject asynchronously (via Boost'sasio) accepts the new connection request. Each request is associated with its ownhandlerobject (defined inhandler.h) that manages the socket, Boostasio's IO context and guard objects, and a buffer object for storing incoming bytes. -
If the Engine is not shutting down, then the
serverfirst callsServer::start_accept()where it continues to listen for more requests. -
The
serverobject asynchronously reads the initial JSON request string from the socket into thehandler's buffer. If thehandlerdoes not receive the complete JSON request within 60s (seeEngine::read_line_timeout_ms) or the JSON line is longer than 1MiB (Engine::read_line_limit) then the request is terminated. In all cases where a request is rejected or does not successfully execute, theserverresponds with a line reporting{"status": "failed"}before closing the connection. -
The
serverobject then callsCommand::from()(defined incommand.{cc,h}) which determines which specificCommandorTaskis being requested. In the above example, a"command"request option is not specified which implies the default command of"recognize". However, if theserverhas not finished loading models, i.e. it is not ready to accept "heavy" requests, then the request is rejected since"recognize"is considered a "heavy" command. Since thebatch-threadsrequest option is non-zero, aTaskDecodeParallelVadobject namedcmdis instantiated (see Decode Tasks for more information). Theconfigure()method is called on thecmdto validate the request options and whether the request can execute given the Engine's request and thread limits, if any. It also selects ASR, NLP, and G2P models to use, defaulting to the first of each model loaded if the client has not requested a specific model. Since the above example involves a WAV-formatted file, a flag is set to indicate that the WAV header must be processed. This step is internally referred to as the "preload" processing. -
Once
cmdis validated, a thread namedheavy_threadis spawned torunthe task. Control proceeds toServer::handle_accept(), which spawns another thread namedrequestto resume IO operations and clean up the connection once the request completes. These objects are of typeboost::threadand notstd::thread; an important distinction is that the former are detached in their destructors before going out of scope. -
The
heavy_threadstarts execution inTaskStreamBase::run(). First, aconsumerthread is created to process the audio. This thread callsTaskDecodeBase::runTask()which notifies the client that the Engine is ready to accept audio, and awaits preload processing. -
The current
heavy_threadasynchronously reads (seeTaskStreamBase::on_read()) a fixed number of bytes (typically 128 bytes, seeEngine::read_stream_buf_size) into the preload buffer. Note thatTaskStreamBasehas two separate buffers - a preload buffer to process a WAV-formatted file header and a main buffer to process audio bytes.TaskDecodeBase::processPreload()then attempts to process the header and extract information like number of audio bytes, number of bytes in a sample (block alignment), number of channels, etc. If there are insufficient number of bytes in the preload buffer to successfully process the header, more bytes are read (up to a max of 1MiB, seeEngine::riff_header_limit) until the header can be processed successfully. -
The
heavy_threadwill thensubmit()sample-aligned bytes into an object of typeStreamBuffer. This class handles synchronization of reads and writes into a queue. It is also used as means of communication between theheavy_threadandconsumerthread by storing whether the request is aborted, end-of-stream is reached, or theconsumerthread is done reading audio (applicable in case ofbatch-intervals). -
In parallel, once the WAV header preload processing finishes, the
consumerthread determines the audio's sample rate and encoding and configures the sample rate conversion object. The Engine uses thelibsamplerate(SRC) library to perform sample rate conversion; this is only needed if the sample rate of the audio does not match the rate used when training the selected ASR model. For internal processing, the audio encoding is always converted to 16-bit signed integer PCM. -
The
consumerthread then callsTaskDecodeParallelVad::runTaskBase(). Here, another thread calledwrite_threadis created to send results to the client. Then VAD-related settings are configured. The Engine uses thelibfvadlibrary for performing VAD. A thread pool is created with 4 threads as requested by the client in the above example request. This thread pool performs the actual ASR decoding of the prepared audio segments. -
Control then proceeds to a
whileloop that runs as long as there is audio to be processed. It reads data from theStreamBufferqueue as and when it has data. Thenext()method that does this read also performs sample rate conversion on the data, if needed. In thewhileloop, audio bytes are accumulated till a segment is created (seetask-decode-parallel-vad.ccfor comments explaining how a VAD endpoint is determined). Apromiseobject is created to get the result JSON and the future to this object is stored so that thewrite_threadcan send results to the client in order. The audio segment is submitted to the thread pool to be decoded viaTaskDecodeParallelBase::decodeAudio(). -
This
decodeAudio()function sets decoder options, creates a decoder object, extracts features from the segment's samples and performs decoding. A result JSON object is created with required response fields (e.g..transcript,.status, etc.) and additional fields as may be requested (e.g. word-level timestamps, alternatives, etc.). The JSON-encoded string representation of thisnlohmann::jsonresult object is stored in the 'future' for thewrite_threadto process. -
If at any point in the asynchronous read process if there is a delay of more than 10s (see
Engine::read_stream_timeout_ms) then the request is aborted and a failure status is sent to the client. Otherwise, onceheavy_threadfinishes reading the audio stream, i.e.eos()is reached, it sets a corresponding flag in theStreamBufferobject and waits for theconsumerthread to finish its processing. -
The
consumerthread finishes once all the audio data is decoded and replies are sent to the client by thewrite_thread. Now that the request is completed, therequestthread mentioned above closes the connection and performs garbage collection if the Engine was compiled with certain memory management libraries (e.g.libtcmalloc).
This concludes the processing of the request. Note that every time a new thread is spawned to handle a new request, Boost logging thread-scope attributes are set so that the log statements are printed with the correct metadata for that request. The request is also associated with both a random UUID as well as an incrementing request number, to facilitate tracing and debugging.
The majority of the changes from core Kaldi are in the directories src/mrp, src/mrpbin,
src/mrpboost, and src/mrpboostbin. The changes to core Kaldi are primarily in wave file handling
(src/feat/wave-reader.{cc,h}) and encryption/compression (src/util/kaldi-io.cc and files
matching src/util/mrp-*).
If you want to extend the Engine with additional Kaldi (or OpenFST) functionality (e.g. lattice
rescoring), the primary files you will have to modify will likely be src/mrpboost/asr.{cc,h} and
src/mrpboost/task-decode-base.{cc,h}. The class ASRConfig (defined in asr.{cc,h}) handles
Kaldi-style argument handling. For example, if a new feature requires a model file, you should add
the command line handling code to ASRConfig, the model file itself to the model directory, and the
command line text (e.g. --rescore-lm=graph/rescore.4gram.lm) to conf/model.conf in the model
directory.
The class ASRInfo (defined in asr.{cc,h}) loads and stores information provided by
ASRConfig. To continue the example of language model rescoring, you should add a variable of type
ConstArpaLm to store the language model file, and add code to the ASRInfo constructor to load
the file that was specified in ASRConfig.
The majority of the code related to recognition is in the files matching
src/mrpboost/task-decode-*.{cc,h}. The post-decoding processing is concentrated in the method
prepare_json_response() in src/mrpboost/task-decode-base.cc. This method takes a decoder object
containing a completed recognition (e.g. it has reached either an endpoint or end of file) and
returns a JSON object to be returned to the client based on the options provided by the client and
passed to prepare_json_response(). For the example of language model rescoring, you would first
make sure the variable need_lattice is set correctly (since language model rescoring is typically
performed on a lattice), and then add the code that actually rescores the lattice shortly after the
call to the Kaldi decoder method decoder.GetLattice().
Some potential enhancements that the Mod9 team has not yet implemented:
- (
#722) Rather than abstracting various ASR decoder settings under thespeedrequest option, taking a value between 1 and 9, it would be nice for advanced clients to specify the precise beams that they desire. This should be subject to operator limits (#721) which should be reported by theget-infocommand (#723). - (
#699) If thepartialandendpointrequest options are bothfalse, then it doesn't make sense for a client to specify a small value for thelatencyrequest option; in this case, it should be set as high as possible. - (
#673) It should be possible foradd-wordsto modify a copy of a request-specific graph, or perhaps a named client-specific graph, rather than modifying a graph that is shared with all other clients across their requests. - (
#661) The Engine doesn't implement write timeouts as extensively as it does for read timeouts. - (
#531) The code could be improved by always usingjsonexplicitly instead ofauto. - (
#600) Integrateffmpeglibraries to support non-WAV audio formats and encodings. - (
#680) Allow a custom grammar to be saved and loaded similarly to ASR models. - (
#690) Allow loading an empty graph to support recognition from just a custom grammar. - (
#659) Support ASR models that do not have position-dependent phones. - (
#639) Use fewer threads for each request, e.g. cooperative multi-tasking or callback-based pattern. - (
#555) Enableaudio-urirequest option for downloading remote audio files. - (
#468) Download models from remote URIs at runtime. - (
#679) Add support for language identification. - (
#348) Allow a client-specified transcript to be inserted into a decoded lattice as the 1-best path. - (
#492) HandleSIGINTandSIGQUIT(not justSIGTERM). - (
#605) If--models.mutableis disabled, load aconstfstwithout conversion tovectorfst. - (
#565) Theget-models-infocommand should list loadable models if--models.mutableis enabled. - (
#719) Theget-models-infocommand should accept{asr,g2p,nlp}-modelas request options. - (
#590) Expose a--models.load-alloption to load all available models during Engine startup. - (
#508) Use VAD for non-batch streaming (i.e.task-decode.{cc,h}) to reduce CPU usage during silence. - (
#301) The way default values are set intask-decode-base.his kinda ugly. - (
#163) Allow multiple acoustic models to be loaded, and perform score-level fusion. - (
#677) Accelerate sample rate conversion (see https://github.com/libsndfile/libsamplerate/issues/176). - (
#613) Log and/or report usage metrics, to facilitate usage-based licensing. - (
#572) Allow the~character in model paths (see https://stackoverflow.com/questions/33224941). - (
#724) NLP processing should use a limited-size context for long input strings. - (
#373) Indicate the stability of partial results. - (
#597) Specify a--host=127.0.0.1flag to only listen for local connections. - (
#48) Enable Unix domain sockets for most efficient local usage.
There are several known issues that the Mod9 team has not been able to address:
- (
#533) For strict compliance with C++17, the use of deprecatedstd::random_shufflemust be removed from the core Kaldi libraries in the upstream project (which is still using C++14). While allowed undergcc9, it prevents the Engine from building withclang12. - (
#710) Thecontent-lengthrequest option is ignored for raw-formatted audio. - (
#675) The handling of cgroups v2 is somewhat limited, so determination of memory limits may be incorrect for more complicated setups that require parsing hierarchical cgroups. - (
#627) There is no logging when requests are rejected due to the Engine's inability to accept new requests, e.g. if no threads are available or the Engine is shutting down. - (
#696) It is possible for the WAV format to specify additional sub-chunks after thedatasub-chunk; the Engine will consider these to be unexpected bytes and log a warning. - (
#718) Non-recognizecommands should error if unexpected options are requested. - (
#427) Requestingbatch-intervalsmay result in harmless warnings due to rounding. - (
#359) The Engine might set a.warningmessage multiple times within a single request (especially at the start of processing). Each of these assignments will overwrite the previous warning and so the user will only see the last one. - (
#254) Requests that attempt to modify a decoding graph may be relatively unlikely to acquire a unique lock on graph's mutex when many other decoding threads are actively holding a non-exclusive lock on that same shared mutex. - (
#676) Batch-mode processing with VAD won't work for ASR models with sample rates other the 8kHz, 16kHz, 32kHz, or 48kHz. This is becauselibwebrtcvadonly supports those specific sample rates. The solution is to uselibsamplerateto resample accordingly. However, it is very unusual to train ASR models at samples rates other than 8kHz or 16kHz. - (
#221) Submitting a 44-byte WAV file (all header, no data) is considered an error. Some ASR vendors reference this test file, which should be recognized as zero-duration silence. - (
#581) It is possible to submit a WAV-formatted file to the Engine without any request options specified as a preceding line of JSON (i.e.{"command":"recognize"}is implied). This is a violation of the Engine's published application protocol, but the functionality is intentional and convenient. Consider it an undocumented "Easter egg" :-)