Pre-trained models

This section lists pre-trained models for adding punctuations to text.

You can find all models at the following URL:

sherpa-onnx-online-punct-en-2024-08-06

This model is from https://github.com/frankyoujian/Edge-Punct-Casing and it supports only English.

Please see its paper at https://arxiv.org/abs/2407.13142 for more details.

In the following, we describe how to download and use it with sherpa-onnx.

Download the model

Please use the following commands to download it:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/punctuation-models/sherpa-onnx-online-punct-en-2024-08-06.tar.bz2
tar xvf sherpa-onnx-online-punct-en-2024-08-06.tar.bz2
rm sherpa-onnx-online-punct-en-2024-08-06.tar.bz2

You will find the following files after unzipping:

ls -lh sherpa-onnx-online-punct-en-2024-08-06/
total 74416
-rw-r--r--  1 fangjun  staff   244B Aug  6  2024 README.md
-rw-r--r--  1 fangjun  staff   146K Aug  5  2024 bpe.vocab
-rw-r--r--  1 fangjun  staff   7.1M Aug  5  2024 model.int8.onnx
-rw-r--r--  1 fangjun  staff    28M Aug  5  2024 model.onnx

Note you only need two files:

  • model.onnx + bpe.vocab

  • or model.int8.onnx + bpe.vocab

C++ binary examples

After installing sherpa-onnx, you can use the following command to add punctuations to text with the model.onnx:

./bin/sherpa-onnx-online-punctuation \
  --cnn-bilstm=./sherpa-onnx-online-punct-en-2024-08-06/model.onnx \
  --bpe-vocab=./sherpa-onnx-online-punct-en-2024-08-06/bpe.vocab \
  "how are you i am fine thank you"

The output is given below:

OnlinePunctuationConfig(model=OnlinePunctuationModelConfig(cnn_bilstm="./sherpa-onnx-online-punct-en-2024-08-06/model.onnx", bpe_vocab="./sherpa-onnx-online-punct-en-2024-08-06/bpe.vocab", num_threads=1, debug=False, provider="cpu"))
Creating OnlinePunctuation ...
Started
Done
Num threads: 1
Elapsed seconds: 0.030 s
Input text: how are you i am fine thank you
Output text: How are you? I am fine. Thank you.

To use the model.int8.onnx, you can run:

./bin/sherpa-onnx-online-punctuation \
  --cnn-bilstm=./sherpa-onnx-online-punct-en-2024-08-06/model.int8.onnx \
  --bpe-vocab=./sherpa-onnx-online-punct-en-2024-08-06/bpe.vocab \
  "how are you i am fine thank you"

The output is given below:

OnlinePunctuationConfig(model=OnlinePunctuationModelConfig(cnn_bilstm="./sherpa-onnx-online-punct-en-2024-08-06/model.int8.onnx", bpe_vocab="./sherpa-onnx-online-punct-en-2024-08-06/bpe.vocab", num_threads=1, debug=False, provider="cpu"))
Creating OnlinePunctuation ...
Started
Done
Num threads: 1
Elapsed seconds: 0.013 s
Input text: how are you i am fine thank you
Output text: How are you? I am fine. Thank you.

sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12-int8

This model is converted from

and it supports both Chinese and English.

Hint

If you want to know how the model is converted to sherpa-onnx, please download it and you can find related scripts in the downloaded model directory.

In the following, we describe how to download and use it with sherpa-onnx.

Download the model

Please use the following commands to download it:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/punctuation-models/sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12-int8.tar.bz2

tar xvf sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12-int8.tar.bz2
rm sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12-int8.tar.bz2

You will find the following files after unzipping:

total 155776
-rw-r--r--  1 fangjun  staff   1.5K Jun 18 10:34 README.md
-rwxr-xr-x  1 fangjun  staff   1.6K Apr 12  2024 add-model-metadata.py
-rw-r--r--  1 fangjun  staff   810B Apr 12  2024 config.yaml
-rw-r--r--  1 fangjun  staff    72M Jun 18 10:33 model.int8.onnx
-rwxr-xr-x  1 fangjun  staff   745B Apr 12  2024 show-model-input-output.py
-rwxr-xr-x  1 fangjun  staff   4.6K Apr 12  2024 test.py
-rw-r--r--  1 fangjun  staff   4.0M Apr 12  2024 tokens.json

Only model.int8.onnx is needed in sherpa-onnx. All other files are for your information about how the model is converted to sherpa-onnx.

C++ binary examples

After installing sherpa-onnx, you can use the following command to add punctuations to text:

./bin/sherpa-onnx-offline-punctuation \
  --ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12-int8/model.int8.onnx \
  "我们都是木头人不会说话不会动"

The output is given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:372 ./bin/sherpa-onnx-offline-punctuation --ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12-int8/model.int8.onnx '我们都是木头人不会说话不会动'

OfflinePunctuationConfig(model=OfflinePunctuationModelConfig(ct_transformer="./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12-int8/model.int8.onnx", num_threads=1, debug=False, provider="cpu"))
Creating OfflinePunctuation ...
Started
Done
Num threads: 1
Elapsed seconds: 0.014 s
Input text: 我们都是木头人不会说话不会动
Output text: 我们都是木头人,不会说话,不会动。

The second example is for text containing both Chinese and English:

./bin/sherpa-onnx-offline-punctuation \
  --ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12-int8/model.int8.onnx \
  "这是一个测试你好吗How are you我很好thank you are you ok谢谢你"

Its output is given below:

OfflinePunctuationConfig(model=OfflinePunctuationModelConfig(ct_transformer="./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12-int8/model.int8.onnx", num_threads=1, debug=False, provider="cpu"))
Creating OfflinePunctuation ...
Started
Done
Num threads: 1
Elapsed seconds: 0.010 s
Input text: 这是一个测试你好吗How are you我很好thank you are you ok谢谢你
Output text: 这是一个测试你好吗?How are you?我很好?thank you,are you ok,谢谢你。

The last example is for text containing only English:

./bin/sherpa-onnx-offline-punctuation \
  --ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12-int8/model.int8.onnx \
  "The African blogosphere is rapidly expanding bringing more voices online in the form of commentaries opinions analyses rants and poetry"

The last example is for text containing only English:

OfflinePunctuationConfig(model=OfflinePunctuationModelConfig(ct_transformer="./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12-int8/model.int8.onnx", num_threads=1, debug=False, provider="cpu"))
Creating OfflinePunctuation ...
Started
Done
Num threads: 1
Elapsed seconds: 0.007 s
Input text: The African blogosphere is rapidly expanding bringing more voices online in the form of commentaries opinions analyses rants and poetry
Output text: The African blogosphere is rapidly expanding,bringing more voices online in the form of commentaries,opinions,analyses,rants and poetry。

sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12

This model is converted from

and it supports both Chinese and English.

Hint

If you want to know how the model is converted to sherpa-onnx, please download it and you can find related scripts in the downloaded model directory.

In the following, we describe how to download and use it with sherpa-onnx.

Download the model

Please use the following commands to download it:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/punctuation-models/sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12.tar.bz2

tar xvf sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12.tar.bz2
rm sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12.tar.bz2

You will find the following files after unzipping:

-rw-r--r--  1 fangjun  staff   1.4K Apr 12 12:32 README.md
-rwxr-xr-x  1 fangjun  staff   1.6K Apr 12 14:40 add-model-metadata.py
-rw-r--r--  1 fangjun  staff   810B Apr 12 11:56 config.yaml
-rw-r--r--  1 fangjun  staff    42B Apr 12 11:45 configuration.json
-rw-r--r--  1 fangjun  staff   281M Apr 12 14:40 model.onnx
-rwxr-xr-x  1 fangjun  staff   745B Apr 12 11:53 show-model-input-output.py
-rwxr-xr-x  1 fangjun  staff   4.9K Apr 13 18:45 test.py
-rw-r--r--  1 fangjun  staff   4.0M Apr 12 11:56 tokens.json

Only model.onnx is needed in sherpa-onnx. All other files are for your information about how the model is converted to sherpa-onnx.

C++ binary examples

After installing sherpa-onnx, you can use the following command to add punctuations to text:

./bin/sherpa-onnx-offline-punctuation \
  --ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx \
  "我们都是木头人不会说话不会动"

The output is given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./bin/sherpa-onnx-offline-punctuation --ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx '我们都是木头人不会说话不会动'

OfflinePunctuationConfig(model=OfflinePunctuationModelConfig(ct_transformer="./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx", num_threads=1, debug=False, provider="cpu"))
Creating OfflinePunctuation ...
Started
Done
Num threads: 1
Elapsed seconds: 0.007 s
Input text: 我们都是木头人不会说话不会动
Output text: 我们都是木头人,不会说话不会动。

The second example is for text containing both Chinese and English:

./bin/sherpa-onnx-offline-punctuation \
  --ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx \
  "这是一个测试你好吗How are you我很好thank you are you ok谢谢你"

Its output is given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./bin/sherpa-onnx-offline-punctuation --ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx '这是一个测试你好吗How are you我很好thank you are you ok谢谢你'

OfflinePunctuationConfig(model=OfflinePunctuationModelConfig(ct_transformer="./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx", num_threads=1, debug=False, provider="cpu"))
Creating OfflinePunctuation ...
Started
Done
Num threads: 1
Elapsed seconds: 0.005 s
Input text: 这是一个测试你好吗How are you我很好thank you are you ok谢谢你
Output text: 这是一个测试,你好吗?How are you?我很好?thank you,are you ok,谢谢你。

The last example is for text containing only English:

./bin/sherpa-onnx-offline-punctuation \
  --ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx \
  "The African blogosphere is rapidly expanding bringing more voices online in the form of commentaries opinions analyses rants and poetry"

Its output is given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./bin/sherpa-onnx-offline-punctuation --ct-transformer=./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx 'The African blogosphere is rapidly expanding bringing more voices online in the form of commentaries opinions analyses rants and poetry'

OfflinePunctuationConfig(model=OfflinePunctuationModelConfig(ct_transformer="./sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx", num_threads=1, debug=False, provider="cpu"))
Creating OfflinePunctuation ...
Started
Done
Num threads: 1
Elapsed seconds: 0.003 s
Input text: The African blogosphere is rapidly expanding bringing more voices online in the form of commentaries opinions analyses rants and poetry
Output text: The African blogosphere is rapidly expanding,bringing more voices online in the form of commentaries,opinions,analyses,rants and poetry。

Python API examples

Please see

Huggingface space examples

Please see

Video demos

The following video is in Chinese.