Python API
Hint
It is known to work for Python >= 3.6
on Linux, macOS, and Windows.
In this section, we describe
How to install the Python package sherpa-ncnn
How to use sherpa-ncnn Python API for real-time speech recognition with a microphone
How to use sherpa-ncnn Python API to recognize a single file
Installation
You can use 1
of the 4
methods below to install the Python package sherpa-ncnn:
Method 1
Hint
This method supports x86_64
, arm64
(e.g., Mac M1, 64-bit Raspberry Pi),
and arm32
(e.g., 32-bit Raspberry Pi).
pip install sherpa-ncnn
If you use Method 1
, it will install pre-compiled libraries.
The disadvantage
is that it may not be optimized
for your platform,
while the advantage
is that you don’t need to install cmake
or a
C++ compiler.
For the following methods, you have to first install:
cmake
, which can be installed usingpip install cmake
A C++ compiler, e.g., GCC on Linux and macOS, Visual Studio on Windows
Method 2
git clone https://github.com/k2-fsa/sherpa-ncnn
cd sherpa-ncnn
python3 setup.py install
Method 3
pip install git+https://github.com/k2-fsa/sherpa-ncnn
Method 4 (For developers and embedded boards)
git clone https://github.com/k2-fsa/sherpa-ncnn
cd sherpa-ncnn
mkdir build
cd build
cmake \
-D SHERPA_NCNN_ENABLE_PYTHON=ON \
-D SHERPA_NCNN_ENABLE_PORTAUDIO=OFF \
-D BUILD_SHARED_LIBS=ON \
..
make -j6
export PYTHONPATH=$PWD/lib:$PWD/../sherpa-ncnn/python:$PYTHONPATH
git clone https://github.com/k2-fsa/sherpa-ncnn
cd sherpa-ncnn
mkdir build
cd build
cmake \
-D SHERPA_NCNN_ENABLE_PYTHON=ON \
-D SHERPA_NCNN_ENABLE_PORTAUDIO=OFF \
-D BUILD_SHARED_LIBS=ON \
-DCMAKE_C_FLAGS="-march=armv7-a -mfloat-abi=hard -mfpu=neon" \
-DCMAKE_CXX_FLAGS="-march=armv7-a -mfloat-abi=hard -mfpu=neon" \
..
make -j6
export PYTHONPATH=$PWD/lib:$PWD/../sherpa-ncnn/python:$PYTHONPATH
git clone https://github.com/k2-fsa/sherpa-ncnn
cd sherpa-ncnn
mkdir build
cd build
cmake \
-D SHERPA_NCNN_ENABLE_PYTHON=ON \
-D SHERPA_NCNN_ENABLE_PORTAUDIO=OFF \
-D BUILD_SHARED_LIBS=ON \
-DCMAKE_C_FLAGS="-march=armv8-a" \
-DCMAKE_CXX_FLAGS="-march=armv8-a" \
..
make -j6
export PYTHONPATH=$PWD/lib:$PWD/../sherpa-ncnn/python:$PYTHONPATH
Let us check whether sherpa-ncnn was installed successfully:
python3 -c "import sherpa_ncnn; print(sherpa_ncnn.__file__)"
python3 -c "import _sherpa_ncnn; print(_sherpa_ncnn.__file__)"
They should print the location of sherpa_ncnn
and _sherpa_ncnn
.
Hint
If you use Method 1
, Method 2
, and Method 3
, you can also use
python3 -c "import sherpa_ncnn; print(sherpa_ncnn.__version__)"
It should print the version of sherpa-ncnn, e.g., 1.1
.
Next, we describe how to use sherpa-ncnn Python API for speech recognition:
Real-time speech recognition with a microphone
Recognize a file
Real-time recognition with a microphone
The following Python code shows how to use sherpa-ncnn Python API for real-time speech recognition with a microphone.
Hint
We use sounddevice
for recording. Please run pip install sounddevice
before you run the
code below.
Note
You can download the code from
import sys
try:
import sounddevice as sd
except ImportError as e:
print("Please install sounddevice first. You can use")
print()
print(" pip install sounddevice")
print()
print("to install it")
sys.exit(-1)
import sherpa_ncnn
def create_recognizer():
# Please replace the model files if needed.
# See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
# for download links.
recognizer = sherpa_ncnn.Recognizer(
tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param",
encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin",
decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param",
decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin",
joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param",
joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin",
num_threads=4,
)
return recognizer
def main():
print("Started! Please speak")
recognizer = create_recognizer()
sample_rate = recognizer.sample_rate
samples_per_read = int(0.1 * sample_rate) # 0.1 second = 100 ms
last_result = ""
with sd.InputStream(
channels=1, dtype="float32", samplerate=sample_rate
) as s:
while True:
samples, _ = s.read(samples_per_read) # a blocking read
samples = samples.reshape(-1)
recognizer.accept_waveform(sample_rate, samples)
result = recognizer.text
if last_result != result:
last_result = result
print(result)
if __name__ == "__main__":
devices = sd.query_devices()
print(devices)
default_input_device_idx = sd.default.device[0]
print(f'Use default device: {devices[default_input_device_idx]["name"]}')
try:
main()
Code explanation:
1. Import the required packages
try:
import sounddevice as sd
except ImportError as e:
print("Please install sounddevice first. You can use")
print()
print(" pip install sounddevice")
print()
print("to install it")
sys.exit(-1)
import sherpa_ncnn
Two packages are imported:
sounddevice, for recording with a microphone
sherpa-ncnn, for real-time speech recognition
2. Create the recognizer
def create_recognizer():
# Please replace the model files if needed.
# See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
# for download links.
recognizer = sherpa_ncnn.Recognizer(
tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param",
encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin",
decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param",
decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin",
joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param",
joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin",
num_threads=4,
)
return recognizer
def main():
print("Started! Please speak")
recognizer = create_recognizer()
We use the model csukuangfj/sherpa-ncnn-conv-emformer-transducer-2022-12-06 (Chinese + English) as an example, which is able to recognize both English and Chinese. You can replace it with other pre-trained models.
Please refer to Pre-trained models for more models.
Hint
The above example uses a float16
encoder and joiner. You can also use
the following code to switch to 8-bit
(i.e., int8
) quantized encoder
and joiner.
recognizer = sherpa_ncnn.Recognizer( tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt", encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.int8.param", encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.int8.bin", decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.int8.param", joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.int8.bin", num_threads=4, )
3. Start recording
sample_rate = recognizer.sample_rate
with sd.InputStream(
Note that:
We set channel to 1 since the model supports only a single channel
We use dtype
float32
so that the resulting audio samples are normalized to the range[-1, 1]
.The sampling rate has to be
recognizer.sample_rate
, which is 16 kHz for all models at present.
4. Read audio samples from the microphone
samples_per_read = int(0.1 * sample_rate) # 0.1 second = 100 ms
) as s:
while True:
Note that:
It reads
100 ms
of audio samples at a time. You can choose a larger value, e.g.,200 ms
.No queue or callback is used. Instead, we use a blocking read here.
The
samples
array is reshaped to a1-D
array
5. Invoke the recognizer with audio samples
samples, _ = s.read(samples_per_read) # a blocking read
Note that:
samples
has to be a 1-D tensor and should be normalized to the range[-1, 1]
.Upon accepting the audio samples, the recognizer starts the decoding automatically. There is no separate call for decoding.
6. Get the recognition result
samples = samples.reshape(-1)
recognizer.accept_waveform(sample_rate, samples)
result = recognizer.text
if last_result != result:
We use recognizer.text
to get the recognition result. To avoid
unnecessary output, we compare whether there is new result in recognizer.text
and don’t print to the console if there is nothing new recognized.
That’s it!
Summary
In summary, you need to:
Create the recognizer
Start recording
Read audio samples
Call
recognizer.accept_waveform(sample_rate, samples)
Call
recognizer.text
to get the recognition result
The following is a YouTube video for demonstration.
Hint
If you don’t have access to YouTube, please see the following video from bilibili:
Note
https://github.com/k2-fsa/sherpa-ncnn/blob/master/python-api-examples/speech-recognition-from-microphone-with-endpoint-detection.py supports endpoint detection.
Please see the following video for its usage:
Recognize a file
The following Python code shows how to use sherpa-ncnn Python API to recognize a wave file.
Caution
The sampling rate of the wave file has to be 16 kHz. Also, it should contain only a single channel and samples should be 16-bit (i.e., int16) encoded.
Note
You can download the code from
import wave
import numpy as np
import sherpa_ncnn
def main():
recognizer = sherpa_ncnn.Recognizer(
tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param",
encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin",
decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param",
decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin",
joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param",
joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin",
num_threads=4,
)
filename = (
"./sherpa-ncnn-conv-emformer-transducer-2022-12-06/test_wavs/1.wav"
)
with wave.open(filename) as f:
assert f.getframerate() == recognizer.sample_rate, (
f.getframerate(),
recognizer.sample_rate,
)
assert f.getnchannels() == 1, f.getnchannels()
assert f.getsampwidth() == 2, f.getsampwidth() # it is in bytes
num_samples = f.getnframes()
samples = f.readframes(num_samples)
samples_int16 = np.frombuffer(samples, dtype=np.int16)
samples_float32 = samples_int16.astype(np.float32)
samples_float32 = samples_float32 / 32768
recognizer.accept_waveform(recognizer.sample_rate, samples_float32)
tail_paddings = np.zeros(
int(recognizer.sample_rate * 0.5), dtype=np.float32
)
recognizer.accept_waveform(recognizer.sample_rate, tail_paddings)
recognizer.input_finished()
print(recognizer.text)
if __name__ == "__main__":
main()
We use the model csukuangfj/sherpa-ncnn-conv-emformer-transducer-2022-12-06 (Chinese + English) as an example, which is able to recognize both English and Chinese. You can replace it with other pre-trained models.
Please refer to Pre-trained models for more models.
Hint
The above example uses a float16
encoder and joiner. You can also use
the following code to switch to 8-bit
(i.e., int8
) quantized encoder
and joiner.
recognizer = sherpa_ncnn.Recognizer( tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt", encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.int8.param", encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.int8.bin", decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.int8.param", joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.int8.bin", num_threads=4, )