TensorRT acceleration

This page shows how to use TensorRT engine to accelerate inference speed for K2 models

Preparation

First of all, you have to install the TensorRT. Here we suggest you to use docker container to run TRT. Just run the following command:

docker run --gpus '"device=0"' -it --rm --net host -v $PWD/:/k2 nvcr.io/nvidia/tensorrt:22.12-py3

You can also see here to build TRT on your machine.

Note

Please pay attention that, the TRT version must have to >= 8.5.3!!!

If your TRT version is < 8.5.3, you can download the desired TRT version and then run the following command inside the docker container to use the TRT you just download:

# inside the container
bash tools/install.sh

Model export

You have to prepare the ONNX model by referring here to export your models into ONNX format. Assume you have put your ONNX model in the $model_dir directory. Then, just run the command:

bash tools/build.sh $model_dir
cp $model_dir/encoder.trt model_repo_offline_fast_beam_trt/encoder/1

The generated TRT model will be saved into $model_dir/encoder.trt. We also give an example of model_repo of TRT model. You can follow the same procedure as described here to deploy the pipeline using triton.

Benchmark for Conformer TRT encoder vs ONNX

Model	Batch size	Avg latency(ms)	QPS
ONNX	1	7.44	134.48
	8	14.92	536.09
	16	22.84	700.67
	32	41.62	768.84
	64	80.48	795.27
	128	171.97	744.32
TRT	1	5.21834	193.93
	8	11.7826	703.49
	16	20.4444	815.79
	32	37.583	893.56
	64	69.8312	965.40
	128	139.702	964.57