Python API

Hint

It is known to work for Python >= 3.6 on Linux, macOS, and Windows.

In this section, we describe

How to install the Python package sherpa-ncnn

How to use sherpa-ncnn Python API for real-time speech recognition with a microphone

How to use sherpa-ncnn Python API to recognize a single file

Installation

You can use 1 of the 4 methods below to install the Python package sherpa-ncnn:

Method 1

Hint

This method supports x86_64, arm64 (e.g., Mac M1, 64-bit Raspberry Pi), and arm32 (e.g., 32-bit Raspberry Pi).

pip install sherpa-ncnn

If you use Method 1, it will install pre-compiled libraries. The disadvantage is that it may not be optimized for your platform, while the advantage is that you don’t need to install cmake or a C++ compiler.

For the following methods, you have to first install:

cmake, which can be installed using pip install cmake
A C++ compiler, e.g., GCC on Linux and macOS, Visual Studio on Windows

Method 2

git clone https://github.com/k2-fsa/sherpa-ncnn
cd sherpa-ncnn
python3 setup.py install

Method 3

pip install git+https://github.com/k2-fsa/sherpa-ncnn

Method 4 (For developers and embedded boards)

git clone https://github.com/k2-fsa/sherpa-ncnn
cd sherpa-ncnn
mkdir build
cd build

cmake \
  -D SHERPA_NCNN_ENABLE_PYTHON=ON \
  -D SHERPA_NCNN_ENABLE_PORTAUDIO=OFF \
  -D BUILD_SHARED_LIBS=ON \
  ..

make -j6

export PYTHONPATH=$PWD/lib:$PWD/../sherpa-ncnn/python:$PYTHONPATH

git clone https://github.com/k2-fsa/sherpa-ncnn
cd sherpa-ncnn
mkdir build
cd build

cmake \
  -D SHERPA_NCNN_ENABLE_PYTHON=ON \
  -D SHERPA_NCNN_ENABLE_PORTAUDIO=OFF \
  -D BUILD_SHARED_LIBS=ON \
  -DCMAKE_C_FLAGS="-march=armv7-a -mfloat-abi=hard -mfpu=neon" \
  -DCMAKE_CXX_FLAGS="-march=armv7-a -mfloat-abi=hard -mfpu=neon" \
  ..

make -j6

export PYTHONPATH=$PWD/lib:$PWD/../sherpa-ncnn/python:$PYTHONPATH

git clone https://github.com/k2-fsa/sherpa-ncnn
cd sherpa-ncnn
mkdir build
cd build

cmake \
  -D SHERPA_NCNN_ENABLE_PYTHON=ON \
  -D SHERPA_NCNN_ENABLE_PORTAUDIO=OFF \
  -D BUILD_SHARED_LIBS=ON \
  -DCMAKE_C_FLAGS="-march=armv8-a" \
  -DCMAKE_CXX_FLAGS="-march=armv8-a" \
  ..

make -j6

export PYTHONPATH=$PWD/lib:$PWD/../sherpa-ncnn/python:$PYTHONPATH

Let us check whether sherpa-ncnn was installed successfully:

python3 -c "import sherpa_ncnn; print(sherpa_ncnn.__file__)"
python3 -c "import _sherpa_ncnn; print(_sherpa_ncnn.__file__)"

They should print the location of sherpa_ncnn and _sherpa_ncnn.

Hint

If you use Method 1, Method 2, and Method 3, you can also use

python3 -c "import sherpa_ncnn; print(sherpa_ncnn.__version__)"

It should print the version of sherpa-ncnn, e.g., 1.1.

Next, we describe how to use sherpa-ncnn Python API for speech recognition:

Real-time speech recognition with a microphone

Recognize a file

Real-time recognition with a microphone

The following Python code shows how to use sherpa-ncnn Python API for real-time speech recognition with a microphone.

Hint

We use sounddevice for recording. Please run pip install sounddevice before you run the code below.

Note

You can download the code from

https://github.com/k2-fsa/sherpa-ncnn/blob/master/python-api-examples/speech-recognition-from-microphone.py

Listing 1 Real-time speech recognition with a microphone using sherpa-ncnn Python API

import sys

try:
    import sounddevice as sd
except ImportError as e:
    print("Please install sounddevice first. You can use")
    print()
    print("  pip install sounddevice")
    print()
    print("to install it")
    sys.exit(-1)

import sherpa_ncnn


def create_recognizer():
    # Please replace the model files if needed.
    # See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
    # for download links.
    recognizer = sherpa_ncnn.Recognizer(
        tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
        encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param",
        encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin",
        decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param",
        decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin",
        joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param",
        joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin",
        num_threads=4,
    )
    return recognizer


def main():
    print("Started! Please speak")
    recognizer = create_recognizer()
    sample_rate = recognizer.sample_rate
    samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms
    last_result = ""
    with sd.InputStream(
        channels=1, dtype="float32", samplerate=sample_rate
    ) as s:
        while True:
            samples, _ = s.read(samples_per_read)  # a blocking read
            samples = samples.reshape(-1)
            recognizer.accept_waveform(sample_rate, samples)
            result = recognizer.text
            if last_result != result:
                last_result = result
                print(result)


if __name__ == "__main__":
    devices = sd.query_devices()
    print(devices)
    default_input_device_idx = sd.default.device[0]
    print(f'Use default device: {devices[default_input_device_idx]["name"]}')

    try:
        main()

Code explanation:

1. Import the required packages

try:
    import sounddevice as sd
except ImportError as e:
    print("Please install sounddevice first. You can use")
    print()
    print("  pip install sounddevice")
    print()
    print("to install it")
    sys.exit(-1)

import sherpa_ncnn

Two packages are imported:

sounddevice, for recording with a microphone

sherpa-ncnn, for real-time speech recognition

2. Create the recognizer

def create_recognizer():
    # Please replace the model files if needed.
    # See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
    # for download links.
    recognizer = sherpa_ncnn.Recognizer(
        tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
        encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param",
        encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin",
        decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param",
        decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin",
        joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param",
        joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin",
        num_threads=4,
    )
    return recognizer

def main():
    print("Started! Please speak")
    recognizer = create_recognizer()

We use the model csukuangfj/sherpa-ncnn-conv-emformer-transducer-2022-12-06 (Chinese + English) as an example, which is able to recognize both English and Chinese. You can replace it with other pre-trained models.

Please refer to Pre-trained models for more models.

Hint

The above example uses a float16 encoder and joiner. You can also use the following code to switch to 8-bit (i.e., int8) quantized encoder and joiner.

recognizer = sherpa_ncnn.Recognizer(
    tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
    encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.int8.param",
    encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.int8.bin",
    decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param",
    decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin",
    joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.int8.param",
    joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.int8.bin",
    num_threads=4,
)

3. Start recording

    sample_rate = recognizer.sample_rate
    with sd.InputStream(

Note that:

We set channel to 1 since the model supports only a single channel

We use dtype float32 so that the resulting audio samples are normalized to the range [-1, 1].

The sampling rate has to be recognizer.sample_rate, which is 16 kHz for all models at present.

4. Read audio samples from the microphone

    samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms
    ) as s:
        while True:

Note that:

It reads 100 ms of audio samples at a time. You can choose a larger value, e.g., 200 ms.

No queue or callback is used. Instead, we use a blocking read here.

The samples array is reshaped to a 1-D array

5. Invoke the recognizer with audio samples

            samples, _ = s.read(samples_per_read)  # a blocking read

Note that:

samples has to be a 1-D tensor and should be normalized to the range [-1, 1].

Upon accepting the audio samples, the recognizer starts the decoding automatically. There is no separate call for decoding.

6. Get the recognition result

            samples = samples.reshape(-1)
            recognizer.accept_waveform(sample_rate, samples)
            result = recognizer.text
            if last_result != result:

We use recognizer.text to get the recognition result. To avoid unnecessary output, we compare whether there is new result in recognizer.text and don’t print to the console if there is nothing new recognized.

That’s it!

Summary

In summary, you need to:

Create the recognizer

Start recording

Read audio samples

Call recognizer.accept_waveform(sample_rate, samples)

Call recognizer.text to get the recognition result

The following is a YouTube video for demonstration.

Hint

If you don’t have access to YouTube, please see the following video from bilibili:

Note

https://github.com/k2-fsa/sherpa-ncnn/blob/master/python-api-examples/speech-recognition-from-microphone-with-endpoint-detection.py supports endpoint detection.

Please see the following video for its usage:

Recognize a file

The following Python code shows how to use sherpa-ncnn Python API to recognize a wave file.

Caution

The sampling rate of the wave file has to be 16 kHz. Also, it should contain only a single channel and samples should be 16-bit (i.e., int16) encoded.

Note

You can download the code from

https://github.com/k2-fsa/sherpa-ncnn/blob/master/python-api-examples/decode-file.py

Listing 2 Decode a file with sherpa-ncnn Python API

import wave

import numpy as np
import sherpa_ncnn


def main():
    recognizer = sherpa_ncnn.Recognizer(
        tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
        encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param",
        encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin",
        decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param",
        decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin",
        joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param",
        joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin",
        num_threads=4,
    )

    filename = (
        "./sherpa-ncnn-conv-emformer-transducer-2022-12-06/test_wavs/1.wav"
    )
    with wave.open(filename) as f:
        assert f.getframerate() == recognizer.sample_rate, (
            f.getframerate(),
            recognizer.sample_rate,
        )
        assert f.getnchannels() == 1, f.getnchannels()
        assert f.getsampwidth() == 2, f.getsampwidth()  # it is in bytes
        num_samples = f.getnframes()
        samples = f.readframes(num_samples)
        samples_int16 = np.frombuffer(samples, dtype=np.int16)
        samples_float32 = samples_int16.astype(np.float32)

        samples_float32 = samples_float32 / 32768

    recognizer.accept_waveform(recognizer.sample_rate, samples_float32)

    tail_paddings = np.zeros(
        int(recognizer.sample_rate * 0.5), dtype=np.float32
    )
    recognizer.accept_waveform(recognizer.sample_rate, tail_paddings)

    recognizer.input_finished()

    print(recognizer.text)


if __name__ == "__main__":
    main()

We use the model csukuangfj/sherpa-ncnn-conv-emformer-transducer-2022-12-06 (Chinese + English) as an example, which is able to recognize both English and Chinese. You can replace it with other pre-trained models.

Please refer to Pre-trained models for more models.

Hint

The above example uses a float16 encoder and joiner. You can also use the following code to switch to 8-bit (i.e., int8) quantized encoder and joiner.

recognizer = sherpa_ncnn.Recognizer(
    tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
    encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.int8.param",
    encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.int8.bin",
    decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param",
    decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin",
    joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.int8.param",
    joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.int8.bin",
    num_threads=4,
)