TensorRT acceleration
This page shows how to use TensorRT engine to accelerate inference speed for K2 models
Preparation
First of all, you have to install the TensorRT. Here we suggest you to use docker container to run TRT. Just run the following command:
docker run --gpus '"device=0"' -it --rm --net host -v $PWD/:/k2 nvcr.io/nvidia/tensorrt:22.12-py3
You can also see here to build TRT on your machine.
Note
Please pay attention that, the TRT version must have to >= 8.5.3!!!
If your TRT version is < 8.5.3, you can download the desired TRT version and then run the following command inside the docker container to use the TRT you just download:
# inside the container
bash tools/install.sh
Model export
You have to prepare the ONNX model by referring
here to export your models into ONNX format.
Assume you have put your ONNX model in the $model_dir
directory.
Then, just run the command:
bash tools/build.sh $model_dir
cp $model_dir/encoder.trt model_repo_offline_fast_beam_trt/encoder/1
The generated TRT model will be saved into $model_dir/encoder.trt
.
We also give an example of model_repo
of TRT model. You can follow the same procedure as described
here to deploy the pipeline using triton.
Benchmark for Conformer TRT encoder vs ONNX
Model |
Batch size |
Avg latency(ms) |
QPS |
---|---|---|---|
ONNX |
1 |
7.44 |
134.48 |
8 |
14.92 |
536.09 |
|
16 |
22.84 |
700.67 |
|
32 |
41.62 |
768.84 |
|
64 |
80.48 |
795.27 |
|
128 |
171.97 |
744.32 |
|
TRT |
1 |
5.21834 |
193.93 |
8 |
11.7826 |
703.49 |
|
16 |
20.4444 |
815.79 |
|
32 |
37.583 |
893.56 |
|
64 |
69.8312 |
965.40 |
|
128 |
139.702 |
964.57 |