Pretrained model with GigaSpeech

Hint

We assume you have installed sherpa by following Installation before you start this section.

Download the pretrained model

sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/wgb14/icefall-asr-gigaspeech-pruned-transducer-stateless2

Caution

You have to use git lfs to download/clone the repo. Otherwise, you will be SAD later.

After cloning the repo, you will find the following files:

icefall-asr-gigaspeech-pruned-transducer-stateless2/
|-- README.md
|-- data
|   `-- lang_bpe_500
|       `-- bpe.model
|-- exp
|   |-- cpu_jit-iter-3488000-avg-15.pt
|   |-- cpu_jit-iter-3488000-avg-20.pt
|   |-- pretrained-iter-3488000-avg-15.pt
|   `-- pretrained-iter-3488000-avg-20.pt
  • data/lang_bpe_500/bpe.model is the BPE model used in the training

  • exp/cpu_jit-iter-3488000-avg-15.pt and exp/cpu_jit-iter-3488000-avg-20.pt are two torchscript models exported using torch.jit.script(). We can use any of them in the following tests.

Note

We won’t use pretrained-xxx.pt in sherpa.

Before we start, let us generate tokens.txt from the above bpe.model:

cd icefall-asr-gigaspeech-pruned-transducer-stateless2/data/lang_bpe_500
wget https://raw.githubusercontent.com/k2-fsa/sherpa/master/scripts/bpe_model_to_tokens.py
./bpe_model_to_tokens.py ./bpe.model > tokens.txt

Since the above repo does not contain test waves, we download some test files from https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless5-2022-05-13. for testing.

cd icefall-asr-gigaspeech-pruned-transducer-stateless2
mkdir test_wavs
cd test_wavs

wget https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless5-2022-05-13/resolve/main/test_wavs/1089-134686-0001.wav

wget https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless5-2022-05-13/resolve/main/test_wavs/1221-135766-0001.wav

wget https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless5-2022-05-13/resolve/main/test_wavs/1221-135766-0002.wav

In the following, we show you how to use the downloaded model for speech recognition.

Decode a single wave

nn_model=./icefall-asr-gigaspeech-pruned-transducer-stateless2/exp/cpu_jit-iter-3488000-avg-15.pt
tokens=./icefall-asr-gigaspeech-pruned-transducer-stateless2/data/lang_bpe_500/tokens.txt

wav1=./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1089-134686-0001.wav

sherpa \
  --nn-model=$nn_model \
  --tokens=$tokens \
  --use-gpu=false \
  $wav1

You will see the following output:

[I] /usr/share/miniconda/envs/sherpa/conda-bld/sherpa_1661003501349/work/sherpa/csrc/parse_options.cc:495:int sherpa::ParseOptions::Read(int, const char* const*) 2022-08-20 22:35:42 sherpa --nn-model=./icefall-asr-gigaspeech-pruned-transducer-stateless2/exp/cpu_jit-iter-3488000-avg-15.pt --tokens=./icefall-asr-gigaspeech-pruned-transducer-stateless2/data/lang_bpe_500/tokens.txt --use-gpu=false ./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1089-134686-0001.wav

[I] /usr/share/miniconda/envs/sherpa/conda-bld/sherpa_1661003501349/work/sherpa/csrc/sherpa.cc:126:int main(int, char**) 2022-08-20 22:35:42
--nn-model=./icefall-asr-gigaspeech-pruned-transducer-stateless2/exp/cpu_jit-iter-3488000-avg-15.pt
--tokens=./icefall-asr-gigaspeech-pruned-transducer-stateless2/data/lang_bpe_500/tokens.txt
--decoding-method=greedy_search
--use-gpu=false

[I] /usr/share/miniconda/envs/sherpa/conda-bld/sherpa_1661003501349/work/sherpa/csrc/sherpa.cc:270:int main(int, char**) 2022-08-20 22:35:43
filename: ./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1089-134686-0001.wav
result:  AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS

Hint

You can pass the option --use-gpu=true to use GPU for computation (Assume you have installed a CUDA version of sherpa).

Also, you can use --decoding-method=modified_beam_search to change the decoding method.

Decode multiple waves in parallel

nn_model=./icefall-asr-gigaspeech-pruned-transducer-stateless2/exp/cpu_jit-iter-3488000-avg-15.pt
tokens=./icefall-asr-gigaspeech-pruned-transducer-stateless2/data/lang_bpe_500/tokens.txt

wav1=./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1089-134686-0001.wav
wav2=./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1221-135766-0001.wav
wav3=./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1221-135766-0002.wav

sherpa \
  --nn-model=$nn_model \
  --tokens=$tokens \
  --use-gpu=false \
  $wav1 \
  $wav2 \
  $wav3

You will see the following output:

[I] /usr/share/miniconda/envs/sherpa/conda-bld/sherpa_1661003501349/work/sherpa/csrc/parse_options.cc:495:int sherpa::ParseOptions::Read(int, const char* const*) 2022-08-20 22:38:18 sherpa --nn-model=./icefall-asr-gigaspeech-pruned-transducer-stateless2/exp/cpu_jit-iter-3488000-avg-15.pt --tokens=./icefall-asr-gigaspeech-pruned-transducer-stateless2/data/lang_bpe_500/tokens.txt --use-gpu=false ./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1089-134686-0001.wav ./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1221-135766-0001.wav ./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1221-135766-0002.wav

[I] /usr/share/miniconda/envs/sherpa/conda-bld/sherpa_1661003501349/work/sherpa/csrc/sherpa.cc:126:int main(int, char**) 2022-08-20 22:38:19
--nn-model=./icefall-asr-gigaspeech-pruned-transducer-stateless2/exp/cpu_jit-iter-3488000-avg-15.pt
--tokens=./icefall-asr-gigaspeech-pruned-transducer-stateless2/data/lang_bpe_500/tokens.txt
--decoding-method=greedy_search
--use-gpu=false

[I] /usr/share/miniconda/envs/sherpa/conda-bld/sherpa_1661003501349/work/sherpa/csrc/sherpa.cc:284:int main(int, char**) 2022-08-20 22:38:23
filename: ./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1089-134686-0001.wav
result:  AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS

filename: ./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1221-135766-0001.wav
result:  GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN

filename: ./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1221-135766-0002.wav
result:  YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION

Decode wav.scp

If you have some experience with Kaldi, you must know what wav.scp is.

We use the following code to generate wav.scp for our test data.

cat > wav.scp <<EOF
wav1 ./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1089-134686-0001.wav
wav2 ./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1221-135766-0001.wav
wav3 ./icefall-asr-gigaspeech-pruned-transducer-stateless2/test_wavs/1221-135766-0002.wav
EOF

With the wav.scp ready, we can decode it with the following commands:

nn_model=./icefall-asr-gigaspeech-pruned-transducer-stateless2/exp/cpu_jit-iter-3488000-avg-15.pt
tokens=./icefall-asr-gigaspeech-pruned-transducer-stateless2/data/lang_bpe_500/tokens.txt

sherpa \
  --nn-model=$nn_model \
  --tokens=$tokens \
  --use-gpu=false \
  --use-wav-scp=true \
  scp:wav.scp \
  ark,scp,t:results.ark,results.scp

You will see the following output:

[I] /usr/share/miniconda/envs/sherpa/conda-bld/sherpa_1661003501349/work/sherpa/csrc/parse_options.cc:495:int sherpa::ParseOptions::Read(int, const char* const*) 2022-08-20 22:40:36 sherpa --nn-model=./icefall-asr-gigaspeech-pruned-transducer-stateless2/exp/cpu_jit-iter-3488000-avg-15.pt --tokens=./icefall-asr-gigaspeech-pruned-transducer-stateless2/data/lang_bpe_500/tokens.txt --use-gpu=false --use-wav-scp=true scp:wav.scp ark,scp,t:results.ark,results.scp

[I] /usr/share/miniconda/envs/sherpa/conda-bld/sherpa_1661003501349/work/sherpa/csrc/sherpa.cc:126:int main(int, char**) 2022-08-20 22:40:37
--nn-model=./icefall-asr-gigaspeech-pruned-transducer-stateless2/exp/cpu_jit-iter-3488000-avg-15.pt
--tokens=./icefall-asr-gigaspeech-pruned-transducer-stateless2/data/lang_bpe_500/tokens.txt
--decoding-method=greedy_search
--use-gpu=false

We can view the recognition results using:

$ cat results.ark

wav1 AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
wav2 GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
wav3 YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION

Hint

You can pass the option --batch-size=20 to control the batch size to be 20 during decoding.

Decode feats.scp

If you have precomputed feats, you can decode it with the following code:

nn_model=./icefall-asr-gigaspeech-pruned-transducer-stateless2/exp/cpu_jit-iter-3488000-avg-15.pt
tokens=./icefall-asr-gigaspeech-pruned-transducer-stateless2/data/lang_bpe_500/tokens.txt

sherpa \
  --nn-model=$nn_model \
  --tokens=$tokens \
  --use-gpu=false \
  --use-feats-scp=true \
  scp:feats.scp \
  ark,scp,t:results.ark,results.scp

Hint

You can pass the option --batch-size=20 to control the batch size to be 20 during decoding.

Caution

feats.scp generated by kaldi’s compute-fbank-feats is using unnormalized samples. That is, audio samples are in the range [-32768, 32767]. However, models from icefall are trained with features using normalized samples, i.e., samples in the range [-1, 1].

You cannot use feats.scp generated by Kaldi’s compute-fbank-feats to test models trained from icefall using normalized audio samples. Otherwise, you won’t get good recognition results.

It is perfectly OK to decode feats.scp from Kaldi using a model trained with features using unnormalized audio samples.

Note

We provide a script to generate feats.ark and feats.scp from wav.scp that can be used with models trained by icefall. Please see https://github.com/k2-fsa/sherpa/blob/master/.github/scripts/generate_feats_scp.py