TensorRT acceleration

This page explains how to use the TensorRT engine to accelerate inference for K2 models.

Preparation

First, install TensorRT. We recommend using a Docker container for TRT. Run:

docker run --gpus '"device=0"' -it --rm --net host -v $PWD/:/k2 nvcr.io/nvidia/tensorrt:22.12-py3

You can also see the official TensorRT instructions to build TRT on your machine.

Note

Use TensorRT version 8.5.3 or newer.

If your TRT version is earlier than 8.5.3, download a newer TensorRT release and run the following command inside the Docker container to use it:

# inside the container
bash tools/install.sh

Model export

You have to prepare the ONNX model by referring here to export your models into ONNX format. Assume you have put your ONNX model in the $model_dir directory. Then, just run the command:

bash tools/build.sh $model_dir
cp $model_dir/encoder.trt model_repo_offline_fast_beam_trt/encoder/1

The generated TRT model will be saved into $model_dir/encoder.trt. We also give an example of model_repo of TRT model. You can follow the same procedure as described here to deploy the pipeline using triton.

Benchmark for Conformer TRT encoder vs ONNX

Model

Batch size

Avg latency(ms)

QPS

ONNX

1

7.44

134.48

8

14.92

536.09

16

22.84

700.67

32

41.62

768.84

64

80.48

795.27

128

171.97

744.32

TRT

1

5.21834

193.93

8

11.7826

703.49

16

20.4444

815.79

32

37.583

893.56

64

69.8312

965.40

128

139.702

964.57