Source separation models

This page lists the source separation models supported in sherpa-onnx.

We only describe some of the models. You can find ALL models from the following address:

Spleeter

It is from https://github.com/deezer/spleeter.

We only support the 2-stem model at present.

Hint

For those who want to learn how to convert the PyTorch checkpoint to the model supported in sherpa-onnx, please see the scripts in following address:

There variants of the 2-stem models are given below:

Model

Comment

sherpa-onnx-spleeter-2stems.tar.bz2

No quantization

sherpa-onnx-spleeter-2stems-int8.tar.bz2

int8 quantization

sherpa-onnx-spleeter-2stems-fp16.tar.bz2

fp16 quantization

We describe how to use the fp16 quantized model. Steps below are also applicable to other variants.

Download the model

Please use the following commands to download it:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/source-separation-models/sherpa-onnx-spleeter-2stems-fp16.tar.bz2
tar xvf sherpa-onnx-spleeter-2stems-fp16.tar.bz2
rm sherpa-onnx-spleeter-2stems-fp16.tar.bz2

ls -lh sherpa-onnx-spleeter-2stems-fp16

You should see the following output:

$ ls -lh sherpa-onnx-spleeter-2stems-fp16/

total 76880
-rw-r--r--  1 fangjun  staff    19M May 23 15:27 accompaniment.fp16.onnx
-rw-r--r--  1 fangjun  staff    19M May 23 15:27 vocals.fp16.onnx

Download test files

We use the following two test wave files:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/source-separation-models/qi-feng-le-zh.wav

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/source-separation-models/audio_example.wav
ls -lh audio_example.wav qi-feng-le-zh.wav

-rw-r--r--@ 1 fangjun  staff   1.8M May 23 15:59 audio_example.wav
-rw-r--r--@ 1 fangjun  staff   4.4M May 23 22:06 qi-feng-le-zh.wav

Hint

To make things easier, we support only *.wav files. If you have other formats, e.g., *.mp3, *.mp4, or *.mov, you can use

ffmpeg -i your.mp3 -vn -acodec pcm_s16le -ar 44100 -ac 2 your.wav
ffmpeg -i your.mp4 -vn -acodec pcm_s16le -ar 44100 -ac 2 your.wav
ffmpeg -i your.mov -vn -acodec pcm_s16le -ar 44100 -ac 2 your.wav

to convert them to *.wav files.

The downloaded test files are given below.

Wave filename Content
qi-feng-le-zh.wav
audio_example.wav

Example 1/2 with qi-feng-le-zh.wav

./build/bin/sherpa-onnx-offline-source-separation \
  --spleeter-vocals=sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx \
  --spleeter-accompaniment=sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx \
  --num-threads=1 \
  --input-wav=./qi-feng-le-zh.wav \
  --output-vocals-wav=spleeter_qi_feng_le_vocals.wav \
  --output-accompaniment-wav=spleeter_qi_feng_le_non_vocals.wav

Output logs are given below:

OfflineSourceSeparationConfig(model=OfflineSourceSeparationModelConfig(spleeter=OfflineSourceSeparationSpleeterModelConfig(vocals="sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx", accompaniment="sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx"), uvr=OfflineSourceSeparationUvrModelConfig(model=""), num_threads=1, debug=False, provider="cpu"))
Started
Done
Saved to write to 'spleeter_qi_feng_le_vocals.wav' and 'spleeter_qi_feng_le_non_vocals.wav'
num threads: 1
Elapsed seconds: 2.052 s
Real time factor (RTF): 2.052 / 26.102 = 0.079

Hint

Pay special attention to its RTF. It is super fast, on CPU, with only 1 thread!

Wave filename Content
qi-feng-le-zh.wav
spleeter_qi_feng_le_vocals.wav
spleeter_qi_feng_le_non_vocals.wav

Example 2/2 with audio_example.wav

./build/bin/sherpa-onnx-offline-source-separation \
  --spleeter-vocals=sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx \
  --spleeter-accompaniment=sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx \
  --num-threads=1 \
  --input-wav=./audio_example.wav \
  --output-vocals-wav=spleeter_audio_example_vocals.wav \
  --output-accompaniment-wav=spleeter_audio_example_non_vocals.wav

Output logs are given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:372 ./build/bin/sherpa-onnx-offline-source-separation --spleeter-vocals=sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx --spleeter-accompaniment=sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx --num-threads=1 --input-wav=./audio_example.wav --output-vocals-wav=spleeter_audio_example_vocals.wav --output-accompaniment-wav=spleeter_audio_example_non_vocals.wav

OfflineSourceSeparationConfig(model=OfflineSourceSeparationModelConfig(spleeter=OfflineSourceSeparationSpleeterModelConfig(vocals="sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx", accompaniment="sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx"), uvr=OfflineSourceSeparationUvrModelConfig(model=""), num_threads=1, debug=False, provider="cpu"))
Started
Done
Saved to write to 'spleeter_audio_example_vocals.wav' and 'spleeter_audio_example_non_vocals.wav'
num threads: 1
Elapsed seconds: 0.787 s
Real time factor (RTF): 0.787 / 10.919 = 0.072

Hint

Pay special attention to its RTF. It is super fast, on CPU, with only 1 thread!

Wave filename Content
audio_example.wav
spleeter_audio_example_vocals.wav
spleeter_audio_example_non_vocals.wav

RTF on RK3588

We use the following code to test the RTF of Spleeter on RK3588 with Cortex A76 CPU.

# 1 thread
taskset 0x80  ./build/bin/sherpa-onnx-offline-source-separation \
  --num-threads=1 \
  --spleeter-vocals=sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx \
  --spleeter-accompaniment=sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx \
  --input-wav=./qi-feng-le-zh.wav \
  --output-vocals-wav=spleeter_qi_feng_le_vocals.wav \
  --output-accompaniment-wav=spleeter_qi_feng_le_non_vocals.wav

# 2 threads
taskset 0xc0  ./build/bin/sherpa-onnx-offline-source-separation \
  --num-threads=2 \
  --spleeter-vocals=sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx \
  --spleeter-accompaniment=sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx \
  --input-wav=./qi-feng-le-zh.wav \
  --output-vocals-wav=spleeter_qi_feng_le_vocals.wav \
  --output-accompaniment-wav=spleeter_qi_feng_le_non_vocals.wav

# 3 threads
taskset 0xe0  ./build/bin/sherpa-onnx-offline-source-separation \
  --num-threads=3 \
  --spleeter-vocals=sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx \
  --spleeter-accompaniment=sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx \
  --input-wav=./qi-feng-le-zh.wav \
  --output-vocals-wav=spleeter_qi_feng_le_vocals.wav \
  --output-accompaniment-wav=spleeter_qi_feng_le_non_vocals.wav

# 4 threads
taskset 0xf0  ./build/bin/sherpa-onnx-offline-source-separation \
  --num-threads=4 \
  --spleeter-vocals=sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx \
  --spleeter-accompaniment=sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx \
  --input-wav=./qi-feng-le-zh.wav \
  --output-vocals-wav=spleeter_qi_feng_le_vocals.wav \
  --output-accompaniment-wav=spleeter_qi_feng_le_non_vocals.wav

The results are given below:

num_threads

1

2

3

4

RTF on Cortex A76 CPU

0.258

0.176

0.138

0.127

Python example

Please see

UVR

It is from https://github.com/TRvlvr/model_repo/releases/tag/all_public_uvr_models.

Hint

For those who want to learn how to add meta data to the original ONNX models, please see the scripts in following address:

We support the following UVR models for source separation.

Model

File size (MB)

UVR-MDX-NET-Inst_1.onnx

63.7

UVR-MDX-NET-Inst_2.onnx

63.7

UVR-MDX-NET-Inst_3.onnx

63.7

UVR-MDX-NET-Inst_HQ_1.onnx

63.7

UVR-MDX-NET-Inst_HQ_2.onnx

63.7

UVR-MDX-NET-Inst_HQ_3.onnx

63.7

UVR-MDX-NET-Inst_HQ_4.onnx

56.3

UVR-MDX-NET-Inst_HQ_5.onnx

56.3

UVR-MDX-NET-Inst_Main.onnx

50.3

UVR-MDX-NET-Voc_FT.onnx

63.7

UVR-MDX-NET_Crowd_HQ_1.onnx

56.3

UVR_MDXNET_1_9703.onnx

28.3

UVR_MDXNET_2_9682.onnx

28.3

UVR_MDXNET_3_9662.onnx

28.3

UVR_MDXNET_9482.onnx

28.3

UVR_MDXNET_KARA.onnx

28.3

UVR_MDXNET_KARA_2.onnx

50.3

UVR_MDXNET_Main.onnx

63.7

In the following, we show how to use the model UVR_MDXNET_9482.onnx

Download the model

Please use the following commands to download it:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/source-separation-models/UVR_MDXNET_9482.onnx

ls -lh UVR_MDXNET_9482.onnx

You should see the following output:

ls -lh UVR_MDXNET_9482.onnx

-rw-r--r--  1 fangjun  staff    28M May 31 13:33 UVR_MDXNET_9482.onnx

Download test files

We use the following two test wave files:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/source-separation-models/qi-feng-le-zh.wav

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/source-separation-models/audio_example.wav
ls -lh audio_example.wav qi-feng-le-zh.wav

-rw-r--r--@ 1 fangjun  staff   1.8M May 23 15:59 audio_example.wav
-rw-r--r--@ 1 fangjun  staff   4.4M May 23 22:06 qi-feng-le-zh.wav

Hint

To make things easier, we support only *.wav files. If you have other formats, e.g., *.mp3, *.mp4, or *.mov, you can use

ffmpeg -i your.mp3 -vn -acodec pcm_s16le -ar 44100 -ac 2 your.wav
ffmpeg -i your.mp4 -vn -acodec pcm_s16le -ar 44100 -ac 2 your.wav
ffmpeg -i your.mov -vn -acodec pcm_s16le -ar 44100 -ac 2 your.wav

to convert them to *.wav files.

The downloaded test files are given below.

Wave filename Content
qi-feng-le-zh.wav
audio_example.wav

Example 1/2 with qi-feng-le-zh.wav

./build/bin/sherpa-onnx-offline-source-separation \
  --num-threads=1 \
  --uvr-model=./UVR_MDXNET_9482.onnx \
  --input-wav=./qi-feng-le-zh.wav \
  --output-vocals-wav=uvr_qi_feng_le_vocals.wav \
  --output-accompaniment-wav=uvr_qi_feng_le_non_vocals.wav

Output logs are given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:372 ./build/bin/sherpa-onnx-offline-source-separation --num-threads=1 --uvr-model=./UVR_MDXNET_9482.onnx --input-wav=./qi-feng-le-zh.wav --output-vocals-wav=uvr_qi_feng_le_vocals.wav --output-accompaniment-wav=uvr_qi_feng_le_non_vocals.wav

OfflineSourceSeparationConfig(model=OfflineSourceSeparationModelConfig(spleeter=OfflineSourceSeparationSpleeterModelConfig(vocals="", accompaniment=""), uvr=OfflineSourceSeparationUvrModelConfig(model="./UVR_MDXNET_9482.onnx"), num_threads=1, debug=False, provider="cpu"))
Started
Done
Saved to write to 'uvr_qi_feng_le_vocals.wav' and 'uvr_qi_feng_le_non_vocals.wav'
num threads: 1
Elapsed seconds: 19.110 s
Real time factor (RTF): 19.110 / 26.102 = 0.732

Hint

It is 10x slower than Spleeter! Also, we have selected a small model. If you select a model with more parameters, it is even slower.

Wave filename Content
qi-feng-le-zh.wav
uvr_qi_feng_le_vocals.wav
uvr_qi_feng_le_non_vocals.wav

Example 2/2 with audio_example.wav

./build/bin/sherpa-onnx-offline-source-separation \
  --num-threads=1 \
  --uvr-model=./UVR_MDXNET_9482.onnx \
  --input-wav=./audio_example.wav \
  --output-vocals-wav=uvr_audio_example_vocals.wav \
  --output-accompaniment-wav=uvr_audio_example_non_vocals.wav

Output logs are given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:372 ./build/bin/sherpa-onnx-offline-source-separation --num-threads=1 --uvr-model=./UVR_MDXNET_9482.onnx --input-wav=./audio_example.wav --output-vocals-wav=uvr_audio_example_vocals.wav --output-accompaniment-wav=uvr_audio_example_non_vocals.wav

OfflineSourceSeparationConfig(model=OfflineSourceSeparationModelConfig(spleeter=OfflineSourceSeparationSpleeterModelConfig(vocals="", accompaniment=""), uvr=OfflineSourceSeparationUvrModelConfig(model="./UVR_MDXNET_9482.onnx"), num_threads=1, debug=False, provider="cpu"))
Started
Done
Saved to write to 'uvr_audio_example_vocals.wav' and 'uvr_audio_example_non_vocals.wav'
num threads: 1
Elapsed seconds: 6.420 s
Real time factor (RTF): 6.420 / 10.919 = 0.588
Wave filename Content
audio_example.wav
uvr_audio_example_vocals.wav
uvr_audio_example_non_vocals.wav

Python example

Please see