Offline Punctuation
Add punctuation to unpunctuated text using a CT-Transformer model. This is useful for post-processing ASR output or restoring punctuation in raw text.
Source file
Code
1// Copyright (c) 2023-2024 Xiaomi Corporation
2//
3// Add punctuation to text using a CT-Transformer model.
4//
5// Usage:
6// node offline_punctuation.js
7//
8const sherpa_onnx = require('sherpa-onnx-node');
9
10// Download models from
11// https://github.com/k2-fsa/sherpa-onnx/releases/tag/punctuation-models
12function createPunctuation() {
13 const config = {
14 model: {
15 ctTransformer:
16 './sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx',
17 debug: true,
18 numThreads: 1,
19 provider: 'cpu',
20 },
21 };
22 return new sherpa_onnx.OfflinePunctuation(config);
23}
24
25const punct = createPunctuation();
26
27const sentences = [
28 '这是一个测试你好吗How are you我很好thank you are you ok谢谢你',
29 '我们都是木头人不会说话不会动',
30 'The African blogosphere is rapidly expanding bringing more voices online in the form of commentaries opinions analyses rants and poetry',
31];
32
33console.log('---');
34for (let sentence of sentences) {
35 const punct_text = punct.addPunct(sentence);
36 console.log(`Input: ${sentence}`);
37 console.log(`Output: ${punct_text}`);
38 console.log('---');
39}
How to run
Install the package:
npm install sherpa-onnx-node
Download the model:
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/punctuation-models/sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12.tar.bz2 tar xvf sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12.tar.bz2 rm sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12.tar.bz2
Set the library path and run:
# macOS export DYLD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$DYLD_LIBRARY_PATH # Linux export LD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$LD_LIBRARY_PATH node offline_punctuation.js
Expected output
---
Input: 这是一个测试你好吗How are you我很好thank you are you ok谢谢你
Output: 这是一个测试,你好吗?How are you? 我很好,thank you, are you ok? 谢谢你。
---
Input: 我们都是木头人不会说话不会动
Output: 我们都是木头人,不会说话,不会动。
---
Input: The African blogosphere is rapidly expanding bringing more voices online in the form of commentaries opinions analyses rants and poetry
Output: The African blogosphere is rapidly expanding, bringing more voices online in the form of commentaries, opinions, analyses, rants, and poetry.
---
Notes
The model supports both Chinese and English text.
addPunct()takes a single string and returns the punctuated version.The model handles mixed Chinese-English text correctly.