LibriSpeech demo

Hint

Please first refer to Installation to install sherpa before proceeding.

In this section, we demonstrate how to use sherpa for offline ASR using a Conformer transducer model trained on the LibriSpeech dataset.

Download the pre-trained model

The pre-trained model is in a git repository hosted on huggingface.

Since the pre-trained model is over 10 MB and is managed by git LFS, you have to first install git-lfs before you continue.

On Ubuntu, you can install git-lfs using

sudo apt-get install git-lfs

Hint

If you are using other operating systems, please refer to https://git-lfs.github.com/ for how to install git-lfs on your systems.

After installing git-lfs, we are ready to download the pre-trained model:

git lfs install
git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13

Caution

It is important that you did not forget to run git lfs install. Otherwise, you will be SAD later.

The 3 most important files you just downloaded are:

$ cd icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/

$ ls -lh exp/*jit*
lrwxrwxrwx 1 kuangfangjun root   10 Jun 17 21:52 exp/cpu_jit-torch-1.10.0.pt -> cpu_jit.pt
-rw-r--r-- 1 kuangfangjun root 326M Jun 18 08:58 exp/cpu_jit-torch-1.6.0.pt
-rw-r--r-- 1 kuangfangjun root 326M May 23 00:05 exp/cpu_jit.pt


$ ls -lh data/lang_bpe_500/bpe.mode
-rw-r--r-- 1 kuangfangjun root 240K Mar 12 14:43 data/lang_bpe_500/bpe.model

exp/cpu_jit-torch-1.10.0.pt is a torchscript model exported using torch 1.10, while exp/cpu_jit-torch-1.6.0.pt is exported using torch 1.6.0.

If you are using a version of PyTorch that is older than 1.10, please select exp/cpu_jit-torch-1.6.0.pt. Otherwise, please use exp/cpu_jit-torch-1.10.0.pt.

data/lang_bpe_500/bpe.model is the BPE model that we used during training.

Note

At present, we only implement greedy_search and modified beam_search for decoding, so you only need a torchscript model file and a bpe.model to start the server.

After we implement fast_beam_search, you can also use an FST-based n-gram LM during decoding.

Start the server

The entry point of the server is sherpa/bin/pruned_transducer_statelessX/offline_server.py.

One thing worth mentioning is that the entry point is a Python script. In sherpa, the server is implemented using asyncio, where IO-bound tasks, such as communicating with clients, are implemented in Python, while CPU-bound tasks, such as neural network computation, are implemented in C++ and are invoked by a pool of threads created and managed by Python.

Note

When a thread calls into C++ from Python, it releases the global interpreter lock (GIL) and regains the GIL just before it returns.

In this way, we can maximize the utilization of multi CPU cores.

To view the usage information of the server, you can use:

$ ./sherpa/bin/pruned_transducer_statelessX/offline_server.py --help

which gives the following output:

Output of ./sherpa/bin/pruned_transducer_statelessX/offline_server.py --help

usage: offline_server.py [-h] [--port PORT] [--num-device NUM_DEVICE]
                         [--max-batch-size MAX_BATCH_SIZE]
                         [--max-wait-ms MAX_WAIT_MS]
                         [--feature-extractor-pool-size FEATURE_EXTRACTOR_POOL_SIZE]
                         [--nn-pool-size NN_POOL_SIZE]
                         [--nn-model-filename NN_MODEL_FILENAME]
                         [--bpe-model-filename BPE_MODEL_FILENAME]
                         [--token-filename TOKEN_FILENAME]
                         [--max-message-size MAX_MESSAGE_SIZE]
                         [--max-queue-size MAX_QUEUE_SIZE]
                         [--max-active-connections MAX_ACTIVE_CONNECTIONS]

optional arguments:
  -h, --help            show this help message and exit
  --port PORT           The server will listen on this port (default: 6006)
  --num-device NUM_DEVICE
                        Number of GPU devices to use. Set it to 0 to use CPU
                        for computation. If positive, then GPUs with ID 0, 1,
                        ..., num_device-1 will be used for computation. You
                        can use the environment variable CUDA_VISIBLE_DEVICES
                        to map available GPU devices. (default: 1)
  --max-batch-size MAX_BATCH_SIZE
                        Max batch size for computation. Note if there are not
                        enough requests in the queue, it will wait for
                        max_wait_ms time. After that, even if there are not
                        enough requests, it still sends the available requests
                        in the queue for computation. (default: 25)
  --max-wait-ms MAX_WAIT_MS
                        Max time in millisecond to wait to build batches for
                        inference. If there are not enough requests in the
                        feature queue to build a batch of max_batch_size, it
                        waits up to this time before fetching available
                        requests for computation. (default: 5)
  --feature-extractor-pool-size FEATURE_EXTRACTOR_POOL_SIZE
                        Number of threads for feature extraction. By default,
                        feature extraction are run on CPU. (default: 5)
  --nn-pool-size NN_POOL_SIZE
                        Number of threads for NN computation and decoding.
                        Note: It should be in general less than or equal to
                        num_device if num_device is positive. (default: 1)
  --nn-model-filename NN_MODEL_FILENAME
                        The torchscript model. You can use icefall/egs/librisp
                        eech/ASR/pruned_transducer_statelessX/export.py
                        --jit=1 to generate this model. (default: None)
  --bpe-model-filename BPE_MODEL_FILENAME
                        The BPE model You can find it in the directory
                        egs/librispeech/ASR/data/lang_bpe_xxx from icefall,
                        where xxx is the number of BPE tokens you used to
                        train the model. Note: Use it only when your model is
                        using BPE. You don't need to provide it if you provide
                        `--token-filename` (default: None)
  --token-filename TOKEN_FILENAME
                        Filename for tokens.txt You can find it in the
                        directory egs/aishell/ASR/data/lang_char/tokens.txt
                        from icefall. Note: You don't need to provide it if
                        you provide `--bpe-model` (default: None)
  --max-message-size MAX_MESSAGE_SIZE
                        Max message size in bytes. The max size per message
                        cannot exceed this limit. (default: 1048576)
  --max-queue-size MAX_QUEUE_SIZE
                        Max number of messages in the queue for each
                        connection. (default: 32)
  --max-active-connections MAX_ACTIVE_CONNECTIONS
                        Maximum number of active connections. The server will
                        refuse to accept new connections once the current
                        number of active connections equals to this limit.
                        (default: 500)

The following shows an example about how to use the above pre-trained model to start the server:

Command to start the server using the above pre-trained model

#!/usr/bin/env bash

export CUDA_VISIBLE_DEVICES=0

nn_model_filename=./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/exp/cpu_jit-torch-1.6.0.pt
bpe_model=./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/data/lang_bpe_500/bpe.model

sherpa/bin/pruned_transducer_statelessX/offline_server.py \
  --port 6010 \
  --num-device 1 \
  --max-batch-size 10 \
  --max-wait-ms 5 \
  --feature-extractor-pool-size 5 \
  --nn-pool-size 2 \
  --max-active-connections 10 \
  --nn-model-filename $nn_model_filename \
  --bpe-model-filename $bpe_model

When the server is started, you should see something like below:

Output after starting the server

2022-06-21 18:51:58,424 INFO [offline_server.py:371] started
2022-06-21 18:51:58,426 INFO [server.py:707] server listening on 0.0.0.0:6010
2022-06-21 18:51:58,426 INFO [server.py:707] server listening on [::]:6010

Start the client

We also provide a Python script sherpa/bin/pruned_transducer_statelessX/offline_client.py for the client.

./sherpa/bin/pruned_transducer_statelessX/offline_client.py --help

shows the following help information:

Output of ./sherpa/bin/pruned_transducer_statelessX/offline_client.py --help

usage: offline_client.py [-h] [--server-addr SERVER_ADDR] [--server-port SERVER_PORT] sound_files [sound_files ...]

positional arguments:
  sound_files           The input sound file(s) to transcribe. Supported formats are those supported by torchaudio.load(). For
                        example, wav and flac are supported. The sample rate has to be 16kHz.

optional arguments:
  -h, --help            show this help message and exit
  --server-addr SERVER_ADDR
                        Address of the server (default: localhost)
  --server-port SERVER_PORT
                        Port of the server (default: 6006)

We provide some test waves in the git repo you just cloned. The following command shows you how to start the client:

Start the client and send multiple sound files for recognition

#!/usr/bin/env bash

export CUDA_VISIBLE_DEVICES=0

sherpa/bin/pruned_transducer_statelessX/offline_client.py \
  --server-addr localhost \
  --server-port 6010 \
  ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1089-134686-0001.wav \
  ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0001.wav \
  ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0002.wav

You will see the following output from the client side:

Recogntion results received by the client

Sending ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1089-134686-0001.wav
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1089-134686-0001.wav
 AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS

Sending ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0001.wav
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0001.wav
 GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN

Sending ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0002.wav
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0002.wav
 YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION

while the server side log is:

2022-06-21 18:51:58,424 INFO [offline_server.py:371] started
2022-06-21 18:51:58,426 INFO [server.py:707] server listening on 0.0.0.0:6010
2022-06-21 18:51:58,426 INFO [server.py:707] server listening on [::]:6010
2022-06-21 18:54:05,655 INFO [server.py:642] connection open
2022-06-21 18:54:05,655 INFO [offline_server.py:552] Connected: ('127.0.0.1', 33228). Number of connections: 1/10
2022-06-21 18:54:09,391 INFO [offline_server.py:573] Disconnected: ('127.0.0.1', 33228)
2022-06-21 18:54:09,392 INFO [server.py:260] connection closed

Congratulations! You have succeeded in starting the server and client using a pre-trained model with sherpa.

We provide a colab notebook for you to try this tutorial step by step.

It describes not only how to setup the environment, but it also shows you how to compute the WER and RTF of the LibriSpeech test-clean dataset.