LibriSpeech demo
Hint
Please first refer to Installation to install sherpa before proceeding.
In this section, we demonstrate how to use sherpa for offline ASR using a Conformer transducer model trained on the LibriSpeech dataset.
Download the pre-trained model
The pre-trained model is in a git repository hosted on huggingface.
Since the pre-trained model is over 10 MB and is managed by
git LFS, you have
to first install git-lfs
before you continue.
On Ubuntu, you can install git-lfs
using
sudo apt-get install git-lfs
Hint
If you are using other operating systems, please refer to
https://git-lfs.github.com/ for how to install git-lfs
on your
systems.
After installing git-lfs
, we are ready to download the pre-trained model:
git lfs install
git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13
Caution
It is important that you did not forget to run git lfs install
.
Otherwise, you will be SAD later.
The 3
most important files you just downloaded are:
$ cd icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/
$ ls -lh exp/*jit*
lrwxrwxrwx 1 kuangfangjun root 10 Jun 17 21:52 exp/cpu_jit-torch-1.10.0.pt -> cpu_jit.pt
-rw-r--r-- 1 kuangfangjun root 326M Jun 18 08:58 exp/cpu_jit-torch-1.6.0.pt
-rw-r--r-- 1 kuangfangjun root 326M May 23 00:05 exp/cpu_jit.pt
$ ls -lh data/lang_bpe_500/bpe.mode
-rw-r--r-- 1 kuangfangjun root 240K Mar 12 14:43 data/lang_bpe_500/bpe.model
exp/cpu_jit-torch-1.10.0.pt
is a torchscript model
exported using torch 1.10, while exp/cpu_jit-torch-1.6.0.pt
is exported using torch 1.6.0.
If you are using a version of PyTorch that is older than 1.10, please select
exp/cpu_jit-torch-1.6.0.pt
. Otherwise, please use
exp/cpu_jit-torch-1.10.0.pt
.
data/lang_bpe_500/bpe.model
is the BPE model that we used during training.
Note
At present, we only implement greedy_search
and modified beam_search
for decoding, so you only need a torchscript model file and a bpe.model
to start the server.
After we implement fast_beam_search
, you can also use an FST-based
n-gram LM during decoding.
Start the server
The entry point of the server is sherpa/bin/pruned_transducer_statelessX/offline_server.py.
One thing worth mentioning is that the entry point is a Python script. In sherpa, the server is implemented using asyncio, where IO-bound tasks, such as communicating with clients, are implemented in Python, while CPU-bound tasks, such as neural network computation, are implemented in C++ and are invoked by a pool of threads created and managed by Python.
Note
When a thread calls into C++ from Python, it releases the
global interpreter lock (GIL)
and regains the GIL
just before it returns.
In this way, we can maximize the utilization of multi CPU cores.
To view the usage information of the server, you can use:
$ ./sherpa/bin/pruned_transducer_statelessX/offline_server.py --help
which gives the following output:
usage: offline_server.py [-h] [--port PORT] [--num-device NUM_DEVICE]
[--max-batch-size MAX_BATCH_SIZE]
[--max-wait-ms MAX_WAIT_MS]
[--feature-extractor-pool-size FEATURE_EXTRACTOR_POOL_SIZE]
[--nn-pool-size NN_POOL_SIZE]
[--nn-model-filename NN_MODEL_FILENAME]
[--bpe-model-filename BPE_MODEL_FILENAME]
[--token-filename TOKEN_FILENAME]
[--max-message-size MAX_MESSAGE_SIZE]
[--max-queue-size MAX_QUEUE_SIZE]
[--max-active-connections MAX_ACTIVE_CONNECTIONS]
optional arguments:
-h, --help show this help message and exit
--port PORT The server will listen on this port (default: 6006)
--num-device NUM_DEVICE
Number of GPU devices to use. Set it to 0 to use CPU
for computation. If positive, then GPUs with ID 0, 1,
..., num_device-1 will be used for computation. You
can use the environment variable CUDA_VISIBLE_DEVICES
to map available GPU devices. (default: 1)
--max-batch-size MAX_BATCH_SIZE
Max batch size for computation. Note if there are not
enough requests in the queue, it will wait for
max_wait_ms time. After that, even if there are not
enough requests, it still sends the available requests
in the queue for computation. (default: 25)
--max-wait-ms MAX_WAIT_MS
Max time in millisecond to wait to build batches for
inference. If there are not enough requests in the
feature queue to build a batch of max_batch_size, it
waits up to this time before fetching available
requests for computation. (default: 5)
--feature-extractor-pool-size FEATURE_EXTRACTOR_POOL_SIZE
Number of threads for feature extraction. By default,
feature extraction are run on CPU. (default: 5)
--nn-pool-size NN_POOL_SIZE
Number of threads for NN computation and decoding.
Note: It should be in general less than or equal to
num_device if num_device is positive. (default: 1)
--nn-model-filename NN_MODEL_FILENAME
The torchscript model. You can use icefall/egs/librisp
eech/ASR/pruned_transducer_statelessX/export.py
--jit=1 to generate this model. (default: None)
--bpe-model-filename BPE_MODEL_FILENAME
The BPE model You can find it in the directory
egs/librispeech/ASR/data/lang_bpe_xxx from icefall,
where xxx is the number of BPE tokens you used to
train the model. Note: Use it only when your model is
using BPE. You don't need to provide it if you provide
`--token-filename` (default: None)
--token-filename TOKEN_FILENAME
Filename for tokens.txt You can find it in the
directory egs/aishell/ASR/data/lang_char/tokens.txt
from icefall. Note: You don't need to provide it if
you provide `--bpe-model` (default: None)
--max-message-size MAX_MESSAGE_SIZE
Max message size in bytes. The max size per message
cannot exceed this limit. (default: 1048576)
--max-queue-size MAX_QUEUE_SIZE
Max number of messages in the queue for each
connection. (default: 32)
--max-active-connections MAX_ACTIVE_CONNECTIONS
Maximum number of active connections. The server will
refuse to accept new connections once the current
number of active connections equals to this limit.
(default: 500)
The following shows an example about how to use the above pre-trained model to start the server:
#!/usr/bin/env bash
export CUDA_VISIBLE_DEVICES=0
nn_model_filename=./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/exp/cpu_jit-torch-1.6.0.pt
bpe_model=./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/data/lang_bpe_500/bpe.model
sherpa/bin/pruned_transducer_statelessX/offline_server.py \
--port 6010 \
--num-device 1 \
--max-batch-size 10 \
--max-wait-ms 5 \
--feature-extractor-pool-size 5 \
--nn-pool-size 2 \
--max-active-connections 10 \
--nn-model-filename $nn_model_filename \
--bpe-model-filename $bpe_model
When the server is started, you should see something like below:
2022-06-21 18:51:58,424 INFO [offline_server.py:371] started
2022-06-21 18:51:58,426 INFO [server.py:707] server listening on 0.0.0.0:6010
2022-06-21 18:51:58,426 INFO [server.py:707] server listening on [::]:6010
Start the client
We also provide a Python script sherpa/bin/pruned_transducer_statelessX/offline_client.py for the client.
./sherpa/bin/pruned_transducer_statelessX/offline_client.py --help
shows the following help information:
usage: offline_client.py [-h] [--server-addr SERVER_ADDR] [--server-port SERVER_PORT] sound_files [sound_files ...]
positional arguments:
sound_files The input sound file(s) to transcribe. Supported formats are those supported by torchaudio.load(). For
example, wav and flac are supported. The sample rate has to be 16kHz.
optional arguments:
-h, --help show this help message and exit
--server-addr SERVER_ADDR
Address of the server (default: localhost)
--server-port SERVER_PORT
Port of the server (default: 6006)
We provide some test waves in the git repo you just cloned. The following command shows you how to start the client:
#!/usr/bin/env bash
export CUDA_VISIBLE_DEVICES=0
sherpa/bin/pruned_transducer_statelessX/offline_client.py \
--server-addr localhost \
--server-port 6010 \
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1089-134686-0001.wav \
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0001.wav \
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0002.wav
You will see the following output from the client side:
Sending ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1089-134686-0001.wav
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1089-134686-0001.wav
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
Sending ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0001.wav
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0001.wav
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
Sending ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0002.wav
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0002.wav
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
while the server side log is:
2022-06-21 18:51:58,424 INFO [offline_server.py:371] started
2022-06-21 18:51:58,426 INFO [server.py:707] server listening on 0.0.0.0:6010
2022-06-21 18:51:58,426 INFO [server.py:707] server listening on [::]:6010
2022-06-21 18:54:05,655 INFO [server.py:642] connection open
2022-06-21 18:54:05,655 INFO [offline_server.py:552] Connected: ('127.0.0.1', 33228). Number of connections: 1/10
2022-06-21 18:54:09,391 INFO [offline_server.py:573] Disconnected: ('127.0.0.1', 33228)
2022-06-21 18:54:09,392 INFO [server.py:260] connection closed
Congratulations! You have succeeded in starting the server and client using a pre-trained model with sherpa.
We provide a colab notebook for you to try this tutorial step by step.
It describes not only how to setup the environment, but it also
shows you how to compute the WER
and RTF
of the LibriSpeech
test-clean dataset.