Python API for SenseVoice

This page describes how to use the Python API for SenseVoice.

Please refer to Install the Python Package for how to install the Python package of sherpa-onnx.

The following is a quick way to do that:

pip install sherpa-onnx

Decode a file

After installing the Python package, you can download the Python example code and run it with the following commands:

cd /tmp
git clone https://github.com/k2-fsa/sherpa-onnx.git/
cd sherpa-onnx

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
tar xvf sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
rm sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2

python3 ./python-api-examples/offline-sense-voice-ctc-decode-files.py

You should see something like below:

./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
{"text": "开放时间早上9点至下午5点。", "timestamps": [0.72, 0.96, 1.26, 1.44, 1.92, 2.10, 2.58, 2.82, 3.30, 3.90, 4.20, 4.56, 4.74, 5.46], "tokens":["开", "放", "时", "间", "早", "上", "9", "点", "至", "下", "午", "5", "点", "。"], "words": []}

(py38) fangjuns-MacBook-Pro:sherpa-onnx fangjun$ #python3 ./python-api-examples/offline-sense-voice-ctc-decode-files.py

Wave filename	Content
zh.wav

You can find offline-sense-voice-ctc-decode-files.py at the following address:

https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/offline-sense-voice-ctc-decode-files.py

Speech recognition from a microphone

The following example shows how to use a microphone with SenseVoice and silero-vad for speech recognition:

cd /tmp/sherpa-onnx

# Assuem you have downloaded the SenseVoice model

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

python3 ./python-api-examples/vad-with-non-streaming-asr.py  \
  --silero-vad-model=./silero_vad.onnx \
  --sense-voice=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  --num-threads=2

You should see something like below:

  0 Background Music, Core Audio (2 in, 2 out)
  1 Background Music (UI Sounds), Core Audio (2 in, 2 out)
> 2 MacBook Pro Microphone, Core Audio (1 in, 0 out)
< 3 MacBook Pro Speakers, Core Audio (0 in, 2 out)
  4 WeMeet Audio Device, Core Audio (2 in, 2 out)
Use default device: MacBook Pro Microphone
Creating recognizer. Please wait...
Started! Please speak

If you start speaking, you should see some output after you stop speaking.

Hint

It starts speech recognition after silero-vad detects a pause.

Generate subtitles

This section describes how to use SenseVoice and silero-vad to generate subtitles.

Chinese

Test with a wave file containing Chinese:

cd /tmp/sherpa-onnx

# Assuem you have downloaded the SenseVoice model

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/lei-jun-test.wav

python3 ./python-api-examples/generate-subtitles.py \
  --silero-vad-model=./silero_vad.onnx \
  --sense-voice=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  --num-threads=2 \
  ./lei-jun-test.wav

Wave filename	Content
lei-jun-test.wav

It will generate a text file lei-jun-test.srt, which is given below:

1
0:00:28,934 --> 0:00:36,006
朋友们晚上好，欢迎大家来参加今天晚上的活动，谢谢大家。

2
0:00:42,118 --> 0:00:46,374
这是我第四次颁年度演讲。

3
0:00:46,918 --> 0:00:50,118
前三次呢，因为疫情的原因。

4
0:00:50,406 --> 0:00:55,750
都在小米科技园内举办，现场的人很少。

5
0:00:56,134 --> 0:00:57,574
这是第四次。

6
0:00:58,182 --> 0:01:06,854
我们仔细想了想，我们还是想办一个比较大的聚会，然后呢让我们的新朋友老朋友一起聚一聚。

7
0:01:07,718 --> 0:01:10,886
今天的话呢我们就在北京的。

8
0:01:11,654 --> 0:01:15,142
国家会议中心呢举办了这么一个活动。

9
0:01:15,430 --> 0:01:19,526
现场呢来了很多人，大概有3500人。

10
0:01:19,942 --> 0:01:22,278
还有很多很多的朋友呢。

11
0:01:22,694 --> 0:01:25,798
通过观看直播的方式来参与。

12
0:01:26,342 --> 0:01:30,886
再一次呢对大家的参加表示感谢，谢谢大家。

13
0:01:38,470 --> 0:01:39,910
两个月前。

14
0:01:40,358 --> 0:01:44,486
我参加了今年武汉大学的毕业典礼。

15
0:01:45,926 --> 0:01:47,334
今年呢是。

16
0:01:47,910 --> 0:01:50,694
武汉大学建校130周年。

17
0:01:51,750 --> 0:01:52,838
作为校友。

18
0:01:53,350 --> 0:01:54,886
被母校邀请。

19
0:01:55,206 --> 0:01:57,222
在毕业典礼上致辞。

20
0:01:58,054 --> 0:01:59,558
这对我来说。

21
0:01:59,814 --> 0:02:02,598
是至高无上的荣誉。

22
0:02:03,654 --> 0:02:05,670
站在讲台的那一刻。

23
0:02:06,246 --> 0:02:08,614
面对全校师生。

24
0:02:09,190 --> 0:02:11,462
关于武大的所有的记忆。

25
0:02:11,686 --> 0:02:14,182
一下子涌现在脑海里。

26
0:02:14,982 --> 0:02:17,670
今天呢我就先和大家聊聊。

27
0:02:18,278 --> 0:02:19,494
大往事。

28
0:02:21,830 --> 0:02:23,814
那还是36年前。

29
0:02:25,926 --> 0:02:27,654
1987年。

30
0:02:28,678 --> 0:02:31,622
我呢考上了武汉大学的计算机系。

31
0:02:32,678 --> 0:02:35,174
在武汉大学的图书馆里。

32
0:02:35,398 --> 0:02:36,710
看了一本书。

33
0:02:37,574 --> 0:02:38,630
硅谷之火。

34
0:02:39,334 --> 0:02:41,638
建立了我一生的梦想。

35
0:02:43,302 --> 0:02:44,454
看完书以后。

36
0:02:45,286 --> 0:02:46,438
热血沸腾。

37
0:02:47,590 --> 0:02:49,318
激动的睡不着觉。

38
0:02:50,406 --> 0:02:51,238
我还记得。

39
0:02:52,006 --> 0:02:52,966
那天晚上。

40
0:02:53,318 --> 0:02:54,662
星光很亮。

41
0:02:55,398 --> 0:02:57,670
我就在五大的操场上。

42
0:02:58,342 --> 0:02:59,782
就是屏幕上这个超场。

43
0:03:00,774 --> 0:03:02,470
走了一圈又一圈。

44
0:03:02,950 --> 0:03:05,222
走了整整一个晚上。

45
0:03:06,470 --> 0:03:07,750
我心里有团火。

46
0:03:08,934 --> 0:03:10,310
我也想搬一个。

47
0:03:10,598 --> 0:03:11,814
伟大的公司。

48
0:03:13,958 --> 0:03:14,822
就是这样。

49
0:03:17,606 --> 0:03:18,822
梦想之火。

50
0:03:19,270 --> 0:03:22,502
在我心里彻底点燃了。

51
0:03:29,766 --> 0:03:30,534
但是。

52
0:03:30,758 --> 0:03:32,550
一个大一的新生。

53
0:03:40,326 --> 0:03:42,726
是一个大一的新生。

54
0:03:43,814 --> 0:03:47,046
一个从县城里出来的年轻人。

55
0:03:48,134 --> 0:03:50,630
什么也不会，什么也没有。

56
0:03:51,526 --> 0:03:56,326
就想创办一家伟大的公司，这不就是天方夜谭吗？

57
0:03:57,574 --> 0:04:00,102
这么离谱的一个梦想。

58
0:04:00,358 --> 0:04:02,278
该如何实现呢？

59
0:04:03,846 --> 0:04:04,934
那天晚上。

60
0:04:05,190 --> 0:04:06,918
我想了一整晚上。

61
0:04:07,974 --> 0:04:08,966
说实话。

62
0:04:10,342 --> 0:04:13,798
越想越糊涂，完全理不清头绪。

63
0:04:14,982 --> 0:04:16,102
后来我在想。

64
0:04:16,774 --> 0:04:18,022
干脆别想了。

65
0:04:18,342 --> 0:04:19,878
把书练好。

66
0:04:20,422 --> 0:04:21,382
是正慑。

67
0:04:22,150 --> 0:04:22,982
所以呢。

68
0:04:23,366 --> 0:04:25,670
我就下定决心认认真真读书。

69
0:04:26,662 --> 0:04:27,174
那么。

70
0:04:28,486 --> 0:04:31,398
我怎么能够把书读的不同凡响呢？

English

Test with a wave file containing English:

cd /tmp/sherpa-onnx

# Assuem you have downloaded the SenseVoice model

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/Obama.wav

python3 ./python-api-examples/generate-subtitles.py \
  --silero-vad-model=./silero_vad.onnx \
  --sense-voice=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  --num-threads=2 \
  ./Obama.wav

Wave filename	Content
Obama.wav

It will generate a text file Obama.srt, which is given below:

1
0:00:09,286 --> 0:00:12,486
Everybody, all right, everybody, go ahead and have a seat.

2
0:00:13,094 --> 0:00:15,014
How's everybody doing today?

3
0:00:18,694 --> 0:00:20,742
How about Tim Sper.

4
0:00:25,894 --> 0:00:31,942
I am here with students at Wakefield High School in Arlington, Virginia.

5
0:00:32,710 --> 0:00:48,326
And we've got students tuning in from all across America, from kindergarten through 12th grade. And I am just so glad that all could join us today. And I want to thank Wakefield for being such an outstanding host. Give yourselves a big line.

6
0:00:54,406 --> 0:00:59,238
Now I know that for many of you, today is the first day of school.

7
0:00:59,590 --> 0:01:09,798
And for those of you in kindergarten or starting middle or high school, it's your first day in a new school, so it's understandable if you're a little nervous.

8
0:01:10,630 --> 0:01:16,006
I imagine there's some seniors out there who are feeling pretty good right now with just one more year to go.

9
0:01:18,790 --> 0:01:27,142
And no matter what grade you're in, some of you are probably wishing it we're still summer, and you could have stayed in bed just a little bit longer this morning.

10
0:01:27,942 --> 0:01:29,414
I know that field.

11
0:01:31,654 --> 0:01:51,708
When I was young, my family lived overseas, I lived in Indonesia for a few years and my mother, she didn't have the money to send me where all the American kids went to school, but she thought it was important for me to keep up with American education, so she decided to teach me extra lessons herself.

12
0:01:52,230 --> 0:01:58,790
Monday through Friday, but because she had to go to work, the only time she could do it was at 430 in the morning.

13
0:02:00,038 --> 0:02:03,750
Now as you might imagine, I wasn't too happy about getting up that early.

14
0:02:04,102 --> 0:02:07,302
A lot of times I'd fall asleep right there at the kitchen table.

15
0:02:08,262 --> 0:02:15,014
But whenever I'd complained, my mother would just give me one of those looks and she'd say, this is no picnic for me either, Buster.

16
0:02:17,094 --> 0:02:25,382
So I know that some of you are still adjusting to being back at school, but I'm here today because I have something important to discuss with you.

17
0:02:25,798 --> 0:02:33,798
I'm here because I want to talk with you about your education and what's expected of all of you in this new school year.

18
0:02:34,470 --> 0:02:40,422
I've given a lot of speeches about education, and I've talked about responsibility a lot.

19
0:02:40,806 --> 0:02:47,174
I've talked about teachers responsibility for inspiring students and pushing you to learn.

20
0:02:47,430 --> 0:02:58,726
I talked about your parents' responsibility for making sure you stay on track and you get your homework done and don't spend every waking hour in front of the TV or with the Xbox.

21
0:02:59,078 --> 0:03:00,774
I've talked a lot about.

22
0:03:01,350 --> 0:03:13,286
Your government's responsibility for setting high standards and supporting teachers and principals and turning around schools that aren't working where students aren't getting the opportunities that they deserve.

23
0:03:13,990 --> 0:03:15,366
But at the end of the day.

24
0:03:16,006 --> 0:03:26,054
We can have the most dedicated teachers, the most supportive parents, the best schools in the world, and none of it will make a difference, none of it will matter.

25
0:03:26,694 --> 0:03:30,694
Unless all of you fulfill your responsibilities.

26
0:03:31,238 --> 0:03:43,814
Unless you show up to those schools, unless you pay attention to those teachers, unless you listen to your parents and grandparents and other adults and put in the hard work it takes to succeed.

27
0:03:44,646 --> 0:03:46,598
That's what I want to focus on today.

28
0:03:47,110 --> 0:03:50,918
The responsibility each of you has for your education.

29
0:03:51,718 --> 0:03:54,854
I want to start with the responsibility you have to yourself.

30
0:03:55,654 --> 0:03:59,078
Every single one of you has something that you're good at.

31
0:03:59,782 --> 0:04:02,406
Every single one of you has something to offer.

32
0:04:02,982 --> 0:04:07,590
And you have a responsibility to yourself to discover what that is.

33
0:04:08,326 --> 0:04:11,494
That's the opportunity an education can provide.

34
0:04:12,326 --> 0:04:22,598
Maybe you could be a great writer, maybe even good enough to write a book or articles in a newspaper, but you might not know it until you write that English paper.

35
0:04:23,078 --> 0:04:25,894
That English class paper that's assigned to you.

36
0:04:26,694 --> 0:04:38,726
Maybe you could be an innovator or an inventor, maybe even good enough to come up with the next iPhone or the new medicine or a vaccine, but you might not know it until you do your project for your science class.

37
0:04:39,814 --> 0:04:44,838
Maybe you could be a mayor or a senator or a Supreme Court justice.

38
0:04:45,350 --> 0:04:50,182
But you might not know that until you join student government or the debate team.

39
0:04:51,558 --> 0:04:56,774
And no matter what you want to do with your life, I guarantee that you'll need an education to do it.

40
0:04:57,318 --> 0:05:00,710
You want to be a doctor or a teacher or a police officer?

41
0:05:00,998 --> 0:05:09,702
You want to be a nurse or an architect, a lawyer or a member of our military, you're going to need a good education for every single one of those careers.

42
0:05:10,054 --> 0:05:14,278
You cannot drop out of school and just drop into a good job.

43
0:05:15,174 --> 0:05:19,846
You've got to train for it and work for it and learn for it.

44
0:05:20,518 --> 0:05:23,654
And this isn't just important for your own life and your own future.

45
0:05:24,678 --> 0:05:29,670
What you make of your education will decide nothing less than the future of this country.

46
0:05:29,958 --> 0:05:32,998
The future of America depends on you.

WebSocket server and client example

This example shows how to use a WebSocket server with SenseVoice for speech recognition.

1. Start the server

Please run

cd /tmp/sherpa-onnx

# Assuem you have downloaded the SenseVoice model

python3 ./python-api-examples/non_streaming_server.py \
  --sense-voice=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt

You should see the following output after starting the server:

2024-07-28 20:22:38,389 INFO [non_streaming_server.py:1001] {'encoder': '', 'decoder': '', 'joiner': '', 'paraformer': '', 'sense_voice': './sherpa-o
nnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx', 'nemo_ctc': '', 'wenet_ctc': '', 'tdnn_model': '', 'whisper_encoder': '', 'whisper_decod
er': '', 'whisper_language': '', 'whisper_task': 'transcribe', 'whisper_tail_paddings': -1, 'tokens': './sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024
-07-17/tokens.txt', 'num_threads': 2, 'provider': 'cpu', 'sample_rate': 16000, 'feat_dim': 80, 'decoding_method': 'greedy_search', 'max_active_paths'
: 4, 'hotwords_file': '', 'hotwords_score': 1.5, 'blank_penalty': 0.0, 'port': 6006, 'max_batch_size': 3, 'max_wait_ms': 5, 'nn_pool_size': 1, 'max_m
essage_size': 1048576, 'max_queue_size': 32, 'max_active_connections': 200, 'certificate': None, 'doc_root': './python-api-examples/web'}
2024-07-28 20:22:41,861 INFO [non_streaming_server.py:647] started
2024-07-28 20:22:41,861 INFO [non_streaming_server.py:659] No certificate provided
2024-07-28 20:22:41,866 INFO [server.py:707] server listening on 0.0.0.0:6006
2024-07-28 20:22:41,866 INFO [server.py:707] server listening on [::]:6006
2024-07-28 20:22:41,866 INFO [non_streaming_server.py:679] Please visit one of the following addresses:

  http://localhost:6006

You can either visit the address http://localhost:6006 or write code to interact with the server.

In the following, we describe possible approaches for interacting with the WebSocket server.

Hint

The WebSocket server is able to handle multiple clients/connections at the same time.

2. Start the client (decode files sequentially)

The following code sends the files in sequential one by one to the server for decoding.

cd /tmp/sherpa-onnx

python3 ./python-api-examples/offline-websocket-client-decode-files-sequential.py ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav  ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav

Wave filename	Content
zh.wav
en.wav

You should see something like below on the server side:

2024-07-28 20:28:15,749 INFO [server.py:642] connection open
2024-07-28 20:28:15,749 INFO [non_streaming_server.py:835] Connected: ('::1', 53252, 0, 0). Number of connections: 1/200
2024-07-28 20:28:15,933 INFO [non_streaming_server.py:851] result: 开放时间早上9点至下午5点。
2024-07-28 20:28:16,194 INFO [non_streaming_server.py:851] result: The tribal chieftain called for the boy and presented him with 50 pieces of gold.
2024-07-28 20:28:16,195 INFO [non_streaming_server.py:819] Disconnected: ('::1', 53252, 0, 0). Number of connections: 0/200
2024-07-28 20:28:16,196 INFO [server.py:260] connection closed

You should see something like below on the client side:

2024-07-28 20:28:15,750 INFO [offline-websocket-client-decode-files-sequential.py:114] Sending ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
开放时间早上9点至下午5点。
2024-07-28 20:28:15,934 INFO [offline-websocket-client-decode-files-sequential.py:114] Sending ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav
The tribal chieftain called for the boy and presented him with 50 pieces of gold.

3. Start the client (decode files in parallel)

The following code sends the files in parallel at the same time to the server for decoding.

cd /tmp/sherpa-onnx

python3 ./python-api-examples/offline-websocket-client-decode-files-paralell.py ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav  ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav

Wave filename	Content
zh.wav
en.wav

You should see something like below on the server side:

2024-07-28 20:31:10,147 INFO [server.py:642] connection open
2024-07-28 20:31:10,148 INFO [non_streaming_server.py:835] Connected: ('::1', 53436, 0, 0). Number of connections: 1/200
2024-07-28 20:31:10,149 INFO [server.py:642] connection open
2024-07-28 20:31:10,149 INFO [non_streaming_server.py:835] Connected: ('::1', 53437, 0, 0). Number of connections: 2/200
2024-07-28 20:31:10,353 INFO [non_streaming_server.py:851] result: 开放时间早上9点至下午5点。
2024-07-28 20:31:10,354 INFO [non_streaming_server.py:819] Disconnected: ('::1', 53436, 0, 0). Number of connections: 1/200
2024-07-28 20:31:10,356 INFO [server.py:260] connection closed
2024-07-28 20:31:10,541 INFO [non_streaming_server.py:851] result: The tribal chieftain called for the boy and presented him with 50 pieces of gold.
2024-07-28 20:31:10,542 INFO [non_streaming_server.py:819] Disconnected: ('::1', 53437, 0, 0). Number of connections: 0/200
2024-07-28 20:31:10,544 INFO [server.py:260] connection closed

You should see something like below on the client side:

2024-07-28 20:31:10,112 INFO [offline-websocket-client-decode-files-paralell.py:139] {'server_addr': 'localhost', 'server_port': 6006, 'sound_files': ['./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav', './sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav']}
2024-07-28 20:31:10,148 INFO [offline-websocket-client-decode-files-paralell.py:113] Sending ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
2024-07-28 20:31:10,191 INFO [offline-websocket-client-decode-files-paralell.py:113] Sending ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav
2024-07-28 20:31:10,353 INFO [offline-websocket-client-decode-files-paralell.py:131] ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
开放时间早上9点至下午5点。
2024-07-28 20:31:10,542 INFO [offline-websocket-client-decode-files-paralell.py:131] ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav
The tribal chieftain called for the boy and presented him with 50 pieces of gold.

4. Start the Web browser client

You can also start a browser to interact with the WebSocket server.

Please visit http://localhost:6006.

Warning

We are not using a certificate to start the server, so the only correct URL is http://localhost:6006.

All of the following addresses are incorrect:

Incorrect/Wrong address: https://localhost:6006

Incorrect/Wrong address: http://127.0.0.1:6006

Incorrect/Wrong address: https://127.0.0.1:6006

Incorrect/Wrong address: http://a.b.c.d:6006

Incorrect/Wrong address: https://a.b.c.d:6006

After starting the browser, you should see the following page:

Upload a file for recognition

If we click Upload, we will see the following page:

After clicking Click me to connect and Choose File, you will see the recognition result returned from the server:

Record your speech with a microphone for recognition

If you click Offline-Record, you should see the following page:

Please click the button Click me to connect, and then click the button Offline-Record, then speak, finally, click the button Offline-Stop;

you should see the results from the server. A screenshot is given below:

Note that you can save the recorded audio into a wave file for debugging.

The recorded audio from the above screenshot is saved to test.wav and is given below:

Input File     : 'test.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:07.00 = 112012 samples ~ 525.056 CDDA sectors
File Size      : 224k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

Wave filename	Content
test.wav