Streaming WebSocket server and client

Hint

Please refer to Installation to install sherpa-onnx before you read this section.

Build sherpa-onnx with WebSocket support

By default, it will generate the following binaries after Installation:

sherpa-onnx fangjun$ ls -lh build/bin/*websocket*
-rwxr-xr-x  1 fangjun  staff   1.1M Mar 31 22:09 build/bin/sherpa-onnx-offline-websocket-server
-rwxr-xr-x  1 fangjun  staff   1.0M Mar 31 22:09 build/bin/sherpa-onnx-online-websocket-client
-rwxr-xr-x  1 fangjun  staff   1.2M Mar 31 22:09 build/bin/sherpa-onnx-online-websocket-server

Please refer to Non-streaming WebSocket server and client for the usage of sherpa-onnx-offline-websocket-server.

View the server usage

Before starting the server, let us view the help message of sherpa-onnx-online-websocket-server:

build/bin/sherpa-onnx-online-websocket-server

The above command will print the following help information:

Automatic speech recognition with sherpa-onnx using websocket.

Usage:

./bin/sherpa-onnx-online-websocket-server --help

./bin/sherpa-onnx-online-websocket-server \
  --port=6006 \
  --num-work-threads=5 \
  --tokens=/path/to/tokens.txt \
  --encoder=/path/to/encoder.onnx \
  --decoder=/path/to/decoder.onnx \
  --joiner=/path/to/joiner.onnx \
  --log-file=./log.txt \
  --max-batch-size=5 \
  --loop-interval-ms=10

Please refer to
https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html
for a list of pre-trained models to download.

Options:
  --max-batch-size            : Max batch size for recognition. (int, default = 5)
  --loop-interval-ms          : It determines how often the decoder loop runs.  (int, default = 10)
  --max-active-paths          : beam size used in modified beam search. (int, default = 4)
  --decoding-method           : decoding method,now support greedy_search and modified_beam_search. (string, default = "greedy_search")
  --rule3-min-utterance-length : This endpointing rule3 requires utterance-length (in seconds) to be >= this value. (float, default = 20)
  --rule3-min-trailing-silence : This endpointing rule3 requires duration of trailing silence in seconds) to be >= this value. (float, default = 0)
  --rule3-must-contain-nonsilence : If True, for this endpointing rule3 to apply there must be nonsilence in the best-path traceback. For decoding, a non-blank token is considered as non-silence (bool, default = false)
  --rule2-min-utterance-length : This endpointing rule2 requires utterance-length (in seconds) to be >= this value. (float, default = 0)
  --rule1-min-trailing-silence : This endpointing rule1 requires duration of trailing silence in seconds) to be >= this value. (float, default = 2.4)
  --feat-dim                  : Feature dimension. Must match the one expected by the model. (int, default = 80)
  --rule1-must-contain-nonsilence : If True, for this endpointing rule1 to apply there must be nonsilence in the best-path traceback. For decoding, a non-blank token is considered as non-silence (bool, default = false)
  --enable-endpoint           : True to enable endpoint detection. False to disable it. (bool, default = true)
  --num_threads               : Number of threads to run the neural network (int, default = 2)
  --debug                     : true to print model information while loading it. (bool, default = false)
  --port                      : The port on which the server will listen. (int, default = 6006)
  --num-io-threads            : Thread pool size for network connections. (int, default = 1)
  --rule2-must-contain-nonsilence : If True, for this endpointing rule2 to apply there must be nonsilence in the best-path traceback. For decoding, a non-blank token is considered as non-silence (bool, default = true)
  --joiner                    : Path to joiner.onnx (string, default = "")
  --tokens                    : Path to tokens.txt (string, default = "")
  --num-work-threads          : Thread pool size for for neural network computation and decoding. (int, default = 3)
  --encoder                   : Path to encoder.onnx (string, default = "")
  --sample-rate               : Sampling rate of the input waveform. Note: You can have a different sample rate for the input waveform. We will do resampling inside the feature extractor (int, default = 16000)
  --rule2-min-trailing-silence : This endpointing rule2 requires duration of trailing silence in seconds) to be >= this value. (float, default = 1.2)
  --log-file                  : Path to the log file. Logs are appended to this file (string, default = "./log.txt")
  --rule1-min-utterance-length : This endpointing rule1 requires utterance-length (in seconds) to be >= this value. (float, default = 0)
  --decoder                   : Path to decoder.onnx (string, default = "")

Standard options:
  --config                    : Configuration file to read (this option may be repeated) (string, default = "")
  --help                      : Print out usage message (bool, default = false)
  --print-args                : Print the command line arguments (to stderr) (bool, default = true)

Start the server

Hint

Please refer to Pre-trained models for a list of pre-trained models.

./build/bin/sherpa-onnx-online-websocket-server \
  --port=6006 \
  --num-work-threads=3 \
  --num-io-threads=2 \
  --tokens=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/tokens.txt \
  --encoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/joiner-epoch-99-avg-1.onnx \
  --log-file=./log.txt \
  --max-batch-size=5 \
  --loop-interval-ms=20

Caution

The arguments are in the form --key=value.

It does not support --key value.

Hint

In the above demo, the model files are from csukuangfj/sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20 (Bilingual, Chinese + English).

Note

Note that the server supports processing multiple clients in a batch in parallel. You can use --max-batch-size to limit the batch size.

View the usage of the client (C++)

Let us view the usage of the C++ WebSocket client:

./build/bin/sherpa-onnx-online-websocket-client

The above command will print the following help information:

[I] /Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:484:int sherpa_onnx::ParseOptions::Read(int, const char *const *) ./build/bin/sherpa-onnx-online-websocket-client 
[I] /Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:525:void sherpa_onnx::ParseOptions::PrintUsage(bool) const 

Automatic speech recognition with sherpa-onnx using websocket.

Usage:

./bin/sherpa-onnx-online-websocket-client --help

./bin/sherpa-onnx-online-websocket-client \
  --server-ip=127.0.0.1 \
  --server-port=6006 \
  --samples-per-message=8000 \
  --seconds-per-message=0.2 \
  /path/to/foo.wav

It support only wave of with a single channel, 16kHz, 16-bit samples.

Options:
  --seconds-per-message       : We will simulate that each message takes this number of seconds to send. If you select a very large value, it will take a long time to send all the samples (float, default = 0.2)
  --samples-per-message       : Send this number of samples per message. (int, default = 8000)
  --sample-rate               : Sample rate of the input wave. Should be the one expected by the server (int, default = 16000)
  --server-port               : Port of the websocket server (int, default = 6006)
  --server-ip                 : IP address of the websocket server (string, default = "127.0.0.1")

Standard options:
  --help                      : Print out usage message (bool, default = false)
  --print-args                : Print the command line arguments (to stderr) (bool, default = true)
  --config                    : Configuration file to read (this option may be repeated) (string, default = "")

Caution

We only support using IP address for --server-ip.

For instance, please don’t use --server-ip=localhost, use --server-ip=127.0.0.1 instead.

Start the client (C++)

To start the C++ WebSocket client, use:

build/bin/sherpa-onnx-online-websocket-client \
  --seconds-per-message=0.1 \
  --server-port=6006 \
  ./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/0.wav

Since the server is able to process multiple clients at the same time, you can use the following command to start multiple clients:

for i in $(seq 0 10); do
  k=$(expr $i % 5)
  build/bin/sherpa-onnx-online-websocket-client \
    --seconds-per-message=0.1 \
    --server-port=6006 \
    ./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/${k}.wav &
done

wait

echo "done"

View the usage of the client (Python)

Use the following command to view the usage:

python3 ./python-api-examples/online-websocket-client-decode-file.py  --help

Hint

online-websocket-client-decode-file.py is from https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/online-websocket-client-decode-file.py

It will print:

usage: online-websocket-client-decode-file.py [-h] [--server-addr SERVER_ADDR]
                                              [--server-port SERVER_PORT]
                                              [--samples-per-message SAMPLES_PER_MESSAGE]
                                              [--seconds-per-message SECONDS_PER_MESSAGE]
                                              sound_file

positional arguments:
  sound_file            The input sound file. Must be wave with a single
                        channel, 16kHz sampling rate, 16-bit of each sample.

optional arguments:
  -h, --help            show this help message and exit
  --server-addr SERVER_ADDR
                        Address of the server (default: localhost)
  --server-port SERVER_PORT
                        Port of the server (default: 6006)
  --samples-per-message SAMPLES_PER_MESSAGE
                        Number of samples per message (default: 8000)
  --seconds-per-message SECONDS_PER_MESSAGE
                        We will simulate that the duration of two messages is
                        of this value (default: 0.1)

Hint

For the Python client, you can use either a domain name or an IP address for --server-addr. For instance, you can use either --server-addr localhost or --server-addr 127.0.0.1.

For the input argument, you can either use --key=value or --key value.

Start the client (Python)

python3 ./python-api-examples/online-websocket-client-decode-file.py \
  --server-addr localhost \
  --server-port 6006 \
  --seconds-per-message 0.1 \
  ./sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20/test_wavs/4.wav

Start the client (Python, with microphone)

python3 ./python-api-examples/online-websocket-client-microphone.py \
  --server-addr localhost \
  --server-port 6006

 ``online-websocket-client-microphone.py `` is from
 `<https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/online-websocket-client-microphone.py>`_