aishell demo

Hint

Please first refer to Installation to install sherpa before proceeding.

In this section, we demonstrate how to use sherpa for offline ASR using a Conformer transducer model trained on the aishell dataset.

Download the pre-trained model

The pre-trained model is in a git repository hosted on huggingface.

Since the pre-trained model is over 10 MB and is managed by git LFS, you have to first install git-lfs before you continue.

On Ubuntu, you can install git-lfs using

sudo apt-get install git-lfs

Hint

If you are using other operating systems, please refer to https://git-lfs.github.com/ for how to install git-lfs on your systems.

After installing git-lfs, we are ready to download the pre-trained model:

git lfs install
git clone https://huggingface.co/csukuangfj/icefall-aishell-pruned-transducer-stateless3-2022-06-20

Caution

It is important that you did not forget to run git lfs install. Otherwise, you will be SAD later.

The 3 most important files you just downloaded are:

$ cd icefall-models/icefall-aishell-pruned-transducer-stateless3-2022-06-20/

$ ls -lh exp/*jit*
-rw-r--r-- 1 kuangfangjun root 390M Jun 20 11:48 exp/cpu_jit-epoch-29-avg-5-torch-1.10.0.pt
-rw-r--r-- 1 kuangfangjun root 390M Jun 20 12:28 exp/cpu_jit-epoch-29-avg-5-torch-1.6.0.pt

$ ls -lh data/lang_char/tokens.txt
-rw-r--r-- 1 kuangfangjun root 38K Jun 20 10:32 data/lang_char/tokens.txt

exp/cpu_jit-epoch-29-avg-5-torch-1.10.0.pt is a torchscript model exported using torch 1.10, while exp/cpu_jit-epoch-29-avg-5-torch-1.6.0.pt is exported using torch 1.6.0.

If you are using a version of PyTorch that is older than 1.10, please select exp/cpu_jit-epoch-29-avg-5-torch-1.6.0.pt. Otherwise, please use exp/cpu_jit-epoch-29-avg-5-torch-1.10.0.pt.

data/lang_char/tokens.txt is a token table (i.e., word table), containing mappings between words and word IDs.

Note

At present, we only implement greedy_search and modified beam_search for decoding, so you only need a torchscript model file and a tokens.txt to start the server.

After we implement fast_beam_search, you can also use an FST-based n-gram LM during decoding.

Start the server

The entry point of the server is sherpa/bin/pruned_transducer_statelessX/offline_server.py.

One thing worth mentioning is that the entry point is a Python script. In sherpa, the server is implemented using asyncio, where IO-bound tasks, such as communicating with clients, are implemented in Python, while CPU-bound tasks, such as neural network computation, are implemented in C++ and are invoked by a pool of threads created and managed by Python.

Note

When a thread calls into C++ from Python, it releases the global interpreter lock (GIL) and regains the GIL just before it returns.

In this way, we can maximize the utilization of multi CPU cores.

To view the usage information of the server, you can use:

$ ./sherpa/bin/pruned_transducer_statelessX/offline_server.py --help

which gives the following output:

Output of ./sherpa/bin/pruned_transducer_statelessX/offline_server.py --help

usage: offline_server.py [-h] [--port PORT] [--num-device NUM_DEVICE]
                         [--max-batch-size MAX_BATCH_SIZE]
                         [--max-wait-ms MAX_WAIT_MS]
                         [--feature-extractor-pool-size FEATURE_EXTRACTOR_POOL_SIZE]
                         [--nn-pool-size NN_POOL_SIZE]
                         [--nn-model-filename NN_MODEL_FILENAME]
                         [--bpe-model-filename BPE_MODEL_FILENAME]
                         [--token-filename TOKEN_FILENAME]
                         [--max-message-size MAX_MESSAGE_SIZE]
                         [--max-queue-size MAX_QUEUE_SIZE]
                         [--max-active-connections MAX_ACTIVE_CONNECTIONS]

optional arguments:
  -h, --help            show this help message and exit
  --port PORT           The server will listen on this port (default: 6006)
  --num-device NUM_DEVICE
                        Number of GPU devices to use. Set it to 0 to use CPU
                        for computation. If positive, then GPUs with ID 0, 1,
                        ..., num_device-1 will be used for computation. You
                        can use the environment variable CUDA_VISIBLE_DEVICES
                        to map available GPU devices. (default: 1)
  --max-batch-size MAX_BATCH_SIZE
                        Max batch size for computation. Note if there are not
                        enough requests in the queue, it will wait for
                        max_wait_ms time. After that, even if there are not
                        enough requests, it still sends the available requests
                        in the queue for computation. (default: 25)
  --max-wait-ms MAX_WAIT_MS
                        Max time in millisecond to wait to build batches for
                        inference. If there are not enough requests in the
                        feature queue to build a batch of max_batch_size, it
                        waits up to this time before fetching available
                        requests for computation. (default: 5)
  --feature-extractor-pool-size FEATURE_EXTRACTOR_POOL_SIZE
                        Number of threads for feature extraction. By default,
                        feature extraction are run on CPU. (default: 5)
  --nn-pool-size NN_POOL_SIZE
                        Number of threads for NN computation and decoding.
                        Note: It should be in general less than or equal to
                        num_device if num_device is positive. (default: 1)
  --nn-model-filename NN_MODEL_FILENAME
                        The torchscript model. You can use icefall/egs/librisp
                        eech/ASR/pruned_transducer_statelessX/export.py
                        --jit=1 to generate this model. (default: None)
  --bpe-model-filename BPE_MODEL_FILENAME
                        The BPE model You can find it in the directory
                        egs/librispeech/ASR/data/lang_bpe_xxx from icefall,
                        where xxx is the number of BPE tokens you used to
                        train the model. Note: Use it only when your model is
                        using BPE. You don't need to provide it if you provide
                        `--token-filename` (default: None)
  --token-filename TOKEN_FILENAME
                        Filename for tokens.txt You can find it in the
                        directory egs/aishell/ASR/data/lang_char/tokens.txt
                        from icefall. Note: You don't need to provide it if
                        you provide `--bpe-model` (default: None)
  --max-message-size MAX_MESSAGE_SIZE
                        Max message size in bytes. The max size per message
                        cannot exceed this limit. (default: 1048576)
  --max-queue-size MAX_QUEUE_SIZE
                        Max number of messages in the queue for each
                        connection. (default: 32)
  --max-active-connections MAX_ACTIVE_CONNECTIONS
                        Maximum number of active connections. The server will
                        refuse to accept new connections once the current
                        number of active connections equals to this limit.
                        (default: 500)

The following shows an example about how to use the above pre-trained model to start the server:

Command to start the server using the above pre-trained model

#!/usr/bin/env bash

export CUDA_VISIBLE_DEVICES=1

nn_model_filename=./icefall-aishell-pruned-transducer-stateless3-2022-06-20/exp/cpu_jit-epoch-29-avg-5-torch-1.6.0.pt
token_filename=./icefall-aishell-pruned-transducer-stateless3-2022-06-20/data/lang_char/tokens.txt

sherpa/bin/pruned_transducer_statelessX/offline_server.py \
  --port 6010 \
  --num-device 1 \
  --max-batch-size 10 \
  --max-wait-ms 5 \
  --feature-extractor-pool-size 5 \
  --nn-pool-size 2 \
  --max-active-connections 10 \
  --nn-model-filename $nn_model_filename \
  --token-filename $token_filename

When the server is started, you should see something like below:

Output after starting the server

2022-06-21 17:33:10,000 INFO [offline_server.py:371] started
2022-06-21 17:33:10,002 INFO [server.py:707] server listening on 0.0.0.0:6010
2022-06-21 17:33:10,002 INFO [server.py:707] server listening on [::]:6010

Start the client

We also provide a Python script sherpa/bin/pruned_transducer_statelessX/offline_client.py for the client.

./sherpa/bin/pruned_transducer_statelessX/offline_client.py --help

shows the following help information:

Output of ./sherpa/bin/pruned_transducer_statelessX/offline_client.py --help

usage: offline_client.py [-h] [--server-addr SERVER_ADDR] [--server-port SERVER_PORT] sound_files [sound_files ...]

positional arguments:
  sound_files           The input sound file(s) to transcribe. Supported formats are those supported by torchaudio.load(). For
                        example, wav and flac are supported. The sample rate has to be 16kHz.

optional arguments:
  -h, --help            show this help message and exit
  --server-addr SERVER_ADDR
                        Address of the server (default: localhost)
  --server-port SERVER_PORT
                        Port of the server (default: 6006)

We provide some test waves in the git repo you just cloned. The following command shows you how to start the client:

Start the client and send multiple sound files for recognition

#!/usr/bin/env bash

sherpa/bin/pruned_transducer_statelessX/offline_client.py \
  --server-addr localhost \
  --server-port 6010 \
  ./icefall-aishell-pruned-transducer-stateless3-2022-06-20/test_wavs/BAC009S0764W0121.wav \
  ./icefall-aishell-pruned-transducer-stateless3-2022-06-20/test_wavs/BAC009S0764W0122.wav \
  ./icefall-aishell-pruned-transducer-stateless3-2022-06-20/test_wavs/BAC009S0764W0123.wav

You will see the following output from the client side:

Recogntion results received by the client

Sending ./icefall-aishell-pruned-transducer-stateless3-2022-06-20/test_wavs/BAC009S0764W0121.wav
./icefall-aishell-pruned-transducer-stateless3-2022-06-20/test_wavs/BAC009S0764W0121.wav
 甚至出现交易几乎停滞的情况

Sending ./icefall-aishell-pruned-transducer-stateless3-2022-06-20/test_wavs/BAC009S0764W0122.wav
./icefall-aishell-pruned-transducer-stateless3-2022-06-20/test_wavs/BAC009S0764W0122.wav
 一二线城市虽然也处于调整中

Sending ./icefall-aishell-pruned-transducer-stateless3-2022-06-20/test_wavs/BAC009S0764W0123.wav
./icefall-aishell-pruned-transducer-stateless3-2022-06-20/test_wavs/BAC009S0764W0123.wav
 但因为聚集了过多公共资源

while the server side log is:

2022-06-21 17:33:10,000 INFO [offline_server.py:371] started
2022-06-21 17:33:10,002 INFO [server.py:707] server listening on 0.0.0.0:6010
2022-06-21 17:33:10,002 INFO [server.py:707] server listening on [::]:6010
2022-06-21 17:39:30,148 INFO [server.py:642] connection open
2022-06-21 17:39:30,148 INFO [offline_server.py:552] Connected: ('127.0.0.1', 59558). Number of connections: 1/10
2022-06-21 17:39:33,757 INFO [offline_server.py:573] Disconnected: ('127.0.0.1', 59558)
2022-06-21 17:39:33,758 INFO [server.py:260] connection closed

Congratulations! You have succeeded in starting the server and client using a pre-trained model with sherpa.

We provide a colab notebook for you to try this tutorial step by step.

It describes not only how to setup the environment, but it also shows you how to compute the WER and RTF of the aishell test dataset.