[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]
The Engine allows an advanced endpoint-rules
request option that can be used
to customize the endpointing system for a "recognize"
request in default mode.
As an example, we perform a recognition job with a rule to endpoint any time there is 0.5 seconds of consecutive silence when the current utterance is longer than 7 seconds.
curl -sLO mod9.io/SW_4824_B.wav
(echo '{"cmd":"recognize", "endpoint-rules": {"rule1":{"min-utterance-length":7, "min-trailing-silence":0.5}}}'; \
cat SW_4824_B.wav) | nc $HOST PORT
Each time the Engine reads a chunk of audio, it needs to decide whether the utterance it is currently processing has reached an endpoint. Endpointing in Kaldi is implemented as a disjunction (a chained OR) of several endpointing rules.
// Returns a boolean. True if we've reached an endpoint.
// This is called every time the engine reads in a chunk of audio.
EndpointDetected {
if (rule0.Activated) {
return true;
}
.
.
if (rule5.Activated) {
return true;
}
return false;
}
Internally, the engine has 6 rules. Each rule has the same structure: they're all the same function, but vary in their parameters. Each rule is a conjunction (a chain of ANDs) of several parameters.
// Returns true if this endpointing rule detects an endpoint.
Rule::Activated {
return
(contains_nonsilence OR !rule.must_contain_nonsilence) AND
trailing_silence >= rule.min_trailing_silence AND
relative_cost <= rule.max_relative_cost AND
utterance_length >= rule.min_utterance_length AND
utterance_length <= rule.max_utterance_length;
}
The endpoint-rules are customized by overwriting these parameters.
When writing endpointing rules, there are a few useful principles to keep in mind.
- The Engine should endpoint and end a segment during long pauses.
- Longer segments generally are more accurate because they have more audio and language context.
- If the system only outputs long segments, there will be high latency in between messages, which might make the engine feel unresponsive, especially when processing live audio in real time.
- As the utterance gets longer the internal lattice representing the current utterance grows in complexity. Thus it is practical to tolerate shorter and shorter pauses when dealing with longer utterances.
- The
latency
request option will affect endpointing performance; shorter is better.
Endpoint rules can be passed in as a request option in the initial JSON request
when the request command is "recognize"
and batch
is false
.
NOTE: It is the convention that higher numbered rules deal with longer utterance lengths.
Field | Type | Description |
---|---|---|
endpoint-rules |
object | Add additional endpointing options, overriding defaults and engine command line. The accepted keys are "rule0" ..."rule5" . Example: {"endpoint-rules": {"rule2": {"min-trailing-silence": 0.1}}}
|
The JSON for each endpoint has the following fields:
Field | Type | Description |
---|---|---|
must-contain-nonsilence |
boolean | True if the utterance must have a non-silent frame for us to endpoint. |
min-trailing-silence |
number | Minimum duration in seconds of consecutive silence at the end of the current utterance. We restart counting once we hit nonsilence |
max-relative-cost |
number or "inf"
|
A non-negative cost that is 0 if it is extremely likely we are at a final state, and higher the less likely we are to be at a final state. This is primarily used for small grammars. |
min-utterance-length |
number | Minimum number of seconds of the utterance for this rule to apply (before min-utterance-length seconds, do not apply this rule). |
max-utterance-length |
number | Maximum number of seconds of the utterance for this rule to apply (after max-utterance-length seconds, do not apply this rule). |
To implement a hard cut at 40s, we overwrite rule 6 so that it is always activated
when utterance_length > 40
.
curl -sLO mod9.io/SW_4824_B.wav
(echo '{' \
' "endpoint-rules":' \
' {' \
' "rule5":' \
' { ' \
' "min-utterance-length":40,' \
' "max-utterance-length":100,' \
' "max-relative-cost":"inf",' \
' "min-trailing-silence":0,' \
' "must-contain-nonsilence":false' \
' }' \
' }' \
'}'; cat SW_4824_B.wav) | nc $HOST $PORT
Set the min-utterance-length
of each rule to a duration longer than 20.
©2019-2022 Mod9 Technologies (Version 1.9.5)