sherpa-onnx

Hint

During speech recognition, it does not need to access the Internet. Everyting is processed locally on your device.

We support using onnx with onnxruntime to replace PyTorch for neural network computation. The code is put in a separate repository sherpa-onnx.

sherpa-onnx is self-contained and everything can be compiled from source.

Please refer to https://k2-fsa.github.io/icefall/model-export/export-onnx.html for how to export models to onnx format.

In the following, we describe how to build sherpa-onnx for Linux, macOS, Windows, embedded systems, Android, and iOS.

Also, we show how to use it for speech recognition with pre-trained models.

Tutorials
- 中文资料 (Chinese tutorials)
Installation
- Linux
- macOS
- Windows
  - 64-bit Windows (x64)
  - 32-bit Windows (x86)
- Embedded Linux (aarch64)
- Embedded Linux (arm)
- Embedded Linux (riscv64)
Frequently Asked Question (FAQs)
- 在线、离线、流式、非流式的区别
- How to change the package name of our kotlin and/or java API
- Cannot open shared library libasound_module_conf_pulse.so
- TTS 中文模型没有声音
- ./gitcompile: line 89: libtoolize: command not found
- OSError: PortAudio library not found
- imports github.com/k2-fsa/sherpa-onnx-go-linux: build constraints exclude all Go files
- External buffers are not allowed
- The given version [17] is not supported, only version 1 to 10 is supported in this build
Python
- Install the Python Package
- Decode files
- Real-time speech recognition from a microphone
  - With endpoint detection
  - Without endpoint detection
- Speech recognition from URLs
  - Decode a URL
  - RTMP
- Streaming WebSocket Server
  - Start the server
    - Use Python API
      - Send a file for decoding
      - Send audio samples from a microphone for decoding
    - Use a browser
  - colab
- Non-Streaming WebSocket Server
C API
- Generate required files
  - Build shared libraries
  - Build static libraries
- Build decode-file-c-api.c with generated files
- colab
Java API
- Build JNI interface (macOS)
- Build JNI interface (Linux)
- Build JNI interface (Windows)
  - Download pre-built JNI libs
- Build the jar package
  - Download pre-built jar
- Examples
Javascript API
- Install
- Examples
Kotlin API
- Build JNI interface
- Examples
Swift API
- Build
- Examples
Go API
- Decode files with non-streaming models
- Decode files with streaming models
  - Streaming transducer
- Real-time speech recognition from microphone
  - Streaming transducer
- colab
C# API
- Decode files with non-streaming models
- Decode files with streaming models
  - Streaming transducer
- Real-time speech recognition from microphone
  - Streaming transducer
- colab
Pascal API
- Install free pascal
- Build sherpa-onnx
- Non-streaming speech recognition from files
- Colab notebook
Lazarus
- Pre-built APPs using Lazarus
- Generate subtitles
WebAssembly
- Install Emscripten
- Build
- Huggingface Spaces (WebAssembly)
Android
- Pre-built APKs
- Build sherpa-onnx for Android
HarmonyOS
- Pre-built HAPs
- On-device speaker identification (本地说话人识别)
  - Open the project with DevEco Studio
  - Select a model
    - Use 3dspeaker_speech_eres2net_base_200k_sv_zh-cn_16k-common.onnx
- On-device text-to-speech (TTS)
  - Open the project with DevEco Studio
  - Select a text-to-speech model
    - Use vits-melo-tts-zh_en
    - Use vits-piper-en_US-libritts_r-medium
- On-device VAD + ASR
- How to build sherpa_onnx.har
iOS
- Build sherpa-onnx for iOS
Flutter
- Pre-built Flutter Apps
  - Text to speech (TTS, Speech synthesis)
  - Streaming Speech recognition (STT, ASR)
WebSocket
- Streaming WebSocket server and client
- Non-streaming WebSocket server and client
Hotwords (Contextual biasing)
- What are hotwords
- How do we implement it with an Aho-corasick
- How to use hotwords in sherpa-onnx
Keyword spotting
- What is open vocabulary keyword spotting
- Decoder for open vocabulary keyword spotting
- Keywords file
- How to use keyword spotting in sherpa-onnx
  - command-line tool
- Android application
- Pretrained models
Punctuation
- Pre-trained models
Audio tagging
- Pre-trained models
  - sherpa-onnx-zipformer-small-audio-tagging-2024-04-15
    - Download the model
    - C++ binary examples
      - Cat
      - Whistle
      - Music
      - Laughter
      - Finger snapping
      - Baby cry
      - Smoke alarm
      - Siren
      - Stream water
      - Meow
      - Dog bark
      - Oink (pig)
    - Python API examples
    - Huggingface space
- Android
- WearOS
Spoken language identification
- Pre-trained models
  - whisper
VAD
- silero-vad
- ten-vad
Pre-trained models
- Online transducer models
- Online paraformer models
  - Paraformer models
    - csukuangfj/sherpa-onnx-streaming-paraformer-bilingual-zh-en (Chinese + English)
    - csukuangfj/sherpa-onnx-streaming-paraformer-trilingual-zh-cantonese-en (Chinese + Cantonese + English)
- Online CTC models
  - Zipformer-CTC-based Models
- Offline transducer models
- Offline paraformer models
  - Paraformer models
- Offline CTC models
- TeleSpeech
  - How to export models from Tele-AI/TeleSpeech-ASR to sherpa-onnx
  - Models
    - sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04 (支持非常多种方言)
      - Decode wave files
- Whisper
- WeNet
  - How to export models from WeNet to sherpa-onnx
  - All models from WeNet
    - Colab
- Small models
Moonshine
- Huggingface space
- Models
  - sherpa-onnx-moonshine-tiny-en-int8
  - sherpa-onnx-moonshine-base-en-int8
- Android APKs for Moonshine
- C API examples
- C# API examples
- Dart API examples
- Go API examples
- Java API examples
- JavaScript API examples
  - WebAssembly based npm package
  - node-addon based npm package
- Kotlin API examples
- Pascal API examples
- Python API examples
- Swift API examples
SenseVoice
- Huggingface space
- Export SenseVoice to sherpa-onnx
- Pre-trained Models
  - sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17
- C API for SenseVoice
  - Explanations
- Dart API for SenseVoice
  - Explanations
- Python API for SenseVoice
NeMo
- Speech Recognition models
- Speaker Embedding models
FireRedAsr
- Huggingface space
- Pre-trained Models
  - sherpa-onnx-fire-red-asr-large-zh_en-2025-02-16 (Chinese + English, 普通话、四川话、河南话等)
    - Download
    - Decode a file
Dolphin
- Huggingface space
- Pre-trained Models
拼音词组匹配替换
- 使用场景
- 使用限制
- 使用方法
  - 生成replace.fst
- 测试
- 调试
- 视频演示

Speaker Identification

Speaker Identification

Speech enhancement

Speech enhancement
- Pre-trained models
  - gtcrn_simple
    - Download the model
- Hugginface space for speech enhancement

Source separation

Source separation
- Hugginface space for source separation
- Source separation models
  - Spleeter
  - UVR

RKNN

rknn

tts

Text-to-speech (TTS)