Japanese

Hint

Please refer to Installation to install sherpa-onnx before you read this section.

This page lists offline CTC models from NeMo for Japanese.

sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8 (Japanese, 日语)

This model is converted from https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja.

You can find the code for exporting the model from NeMo to sherpa-onnx at https://github.com/k2-fsa/sherpa-onnx/tree/master/scripts/nemo/parakeet-tdt_ctc-0.6b-ja.

The model was trained on ReazonSpeech v2.0 speech corpus containing more than 35k hours of natural Japanese speech.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8.tar.bz2
tar xvf sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8.tar.bz2
rm sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8.tar.bz2

You should see something like below after downloading:

ls -lh sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/
total 1310808
-rw-r--r--  1 fangjun  staff   625M Jul  9 11:11 model.int8.onnx
drwxr-xr-x  5 fangjun  staff   160B Jul  9 14:22 test_wavs
-rw-r--r--  1 fangjun  staff    28K Jul  9 14:21 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --nemo-ctc-model=./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/model.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/tokens.txt \
  ./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/test_wavs/test_ja_1.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:372 ./build/bin/sherpa-onnx-offline --nemo-ctc-model=./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/model.int8.onnx --tokens=./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/tokens.txt ./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/test_wavs/test_ja_1.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="", decoder_filename="", joiner_filename=""), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model="./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/model.int8.onnx"), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), canary=OfflineCanaryModelConfig(encoder="", decoder="", src_lang="", tgt_lang="", use_pnc=True), telespeech_ctc="", tokens="./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(dict_dir="", lexicon="", rule_fsts=""))
Creating recognizer ...
Started
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:164 Creating a resampler:
   in_sample_rate: 48000
   output_sample_rate: 16000

Done!

./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/test_wavs/test_ja_1.wav
{"lang": "", "emotion": "", "event": "", "text": " 日本語ちゃんと聞っとえてますか?ちゃんと聞こえてんの?ちゃんと聞こえてんの?ってマイクを持ってもらって", "timestamps": [0.00, 1.04, 1.36, 1.60, 1.68, 1.84, 2.08, 2.24, 2.48, 2.64, 4.72, 4.96, 5.20, 5.44, 5.60, 5.76, 6.00, 6.08, 8.08, 8.24, 8.48, 8.72, 8.88, 9.12, 9.36, 9.44, 10.80, 11.04, 11.36, 11.60, 11.68, 11.84, 12.00, 12.16, 12.24, 12.40, 12.56], "tokens":[" ", "日本", "語", "ちゃん", "と", "聞", "っと", "えて", "ます", "か", "?", "ちゃん", "と", "聞", "こ", "えて", "ん", "の", "?", "ちゃん", "と", "聞", "こ", "えて", "ん", "の", "?", "って", "マ", "イ", "ク", "を", "持", "って", "も", "ら", "って"], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.375 s
Real time factor (RTF): 1.375 / 13.000 = 0.106

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --nemo-ctc-model=./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/model.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/tokens.txt

Speech recognition from a microphone with VAD

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --nemo-ctc-model=./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/model.int8.onnx \
  --tokens=./sherpa-onnx-nemo-parakeet-tdt_ctc-0.6b-ja-35000-int8/tokens.txt