4 times faster image segmentation with TRTorch
At Photoroom, we build photo editing apps. One of our core features is to remove the background of any image. We do so by using deep learning models that will segment an image into different layers.
In our quest to improve the speed and the accuracy of our models, we quickly took the habit of compiling them using Nvidia's TensorRT. As detailed in the chart below, leveraging TensorRT's optimized GPU kernels through TRTorch can yield a performance boost of up to 10x:
For very low resolution (160px), the speedup can be up to 10x and up to 1.6x faster for larger resolutions
This blog post details the necessary steps to optimize your PyTorch model for the fastest inference speed:
Part I: Benchmarking your original model's speed
Part II: Boosting inference speed with TRTorch
For a more complete code example check the following notebook.
Part I: Benchmarking your model's speed
To evaluate your original model's speed, we will need to ensure we're using the right settings and environment.
1- CuDNN benchmark mode
If your model uses a fixed input size, you can speed up your inference by enabling the cuDNN benchmark mode as follows:
import torch
torch.backends.cudnn.benchmark = True
2- Beware of asynchronous executions
By default, the GPU tasks run asynchronously, which means that measuring the inference as done below will not work.
start = time.time()
prediction = my_model(dummy_input)
end = time.time()
By the time we arrive at the third line, we have no idea whether the model computations are over or not. More details here. The simplest way to fix this is to force the synchronisation:
start = time.time()
prediction = my_model(dummy_input)
torch.cuda.synchronize()
end = time.time()
3- Use NVIDIA NGC deep learning framework containers.
NVIDIA provides out-of-the-box Docker containers equipped with PyTorch and TensorRT that yield better performance, especially in FP16 mode. Our experience shows up to 1.5x gains when compared to running outside of the Docker image.
4- Aggregating results over multiple runs
As detailed in this StackOverflow answer, to get the best estimate of your model's runtime you should use the minimum time over several runs instead of the average. Some runs may be slowed down by factors unrelated to the model (e.g. noise).
5- Other tricks
Do not forget to put your model in eval mode
PyTorch recently introduced an inference_mode. Like no_grad, it may also be useful to optimize for speed and memory.
If your model has loads of
Conv
followed directly byBatchNorm
layers, you can fuse them beforehand. Depending on the model, this will result in a 10-20% speedup. TRTorch does it automatically.
A simple benchmarking function would look like:
def benchmark(model, resolution, dtype, device):
dummy_input = torch.ones(
(1,3,resolution, resolution),dtype=dtype,device=device
)
# Warm up runs to prepare Cudnn Benchmark
for warm_up_iter in range(10):
prediction = model(dummy_input)
# Benchmark
with torch.no_grad():
durations = list()
for i in range(100):
start = time.time()
prediction = model(dummy_input)
torch.cuda.synchronize()
end = time.time()
durations.append(end-start)
return min(durations)
Part II: Boosting inference speed with TRTorch
TRTorch is a library built on top of TensorRT, which provides an easy way to compile your PyTorch model into a TensorRT graph. The compilation is ahead of time, meaning that the optimisations happen before the first inference run.
The compiled model can then be used through torch.jit
as you would for any scripted/traced model in PyTorch. Also, your compiled model can be used directly from C++ as it is independent of Python.
To compile your PyTorch model, you first need to script/trace it. The corresponding TorchScript module is then fed into TRTorch's compiler along with compilation settings such as:
Input shapes for each input
Operation precision (FP32, FP16, INT8)
1- Setting up TRTorch locally
The easiest way to set up TRTorch locally is to use one of the docker images provided. To do so you can execute the following commands:
git clone https://github.com/NVIDIA/TRTorch.git
cd TRTorch
sudo docker build -f docker/Dockerfile.21.03 -t trtorch:dev
sudo docker run \
--gpus all \
-it \
--shm-size=40gb \
--env="DISPLAY" \
-v /home:/home \
--volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
--name=trtorch \
--ipc=host \
--net=host trtorch:dev
The image comes with all the necessary packages installed (torch, torchvision, trtorch, jupyter, etc...)
2- Scripting with TorchScript
TorchScript saves your model's operation graph in a format that can be executed outside of Python.
Note that if your forward loop uses if/else statements, tracing won't work as it only records the particular flow of operations for the input you provide. For tracing to work, branching in your code must not depend on the input.
net = torchvision.models.resnet101()
net.eval()
dummy_input = torch.randn((1,3,320,320)))
traced_model = torch.jit.trace(net, dummy_input)
3- Compiling a TorchScript module with TRTorch
trtorch_settings = {
"inputs":[
trtorch.Input(
min_shape=[1,3,320,320],
opt_shape=[1,3,320,320],
max_shape=[1,3,320,320],
dtype=torch.float,
)
],
"enabled_precisions": {torch.float},
"debug":True,
}
trtorch_model = trtorch.compile(traced_model, settings)
torch.jit.save(trtorch_model, "./my_trtorch_model.ts")
Once your model is compiled, you can use it through torch.jit.load
:
my_compiled_model = torch.jit.load("./my_trtorch_model.ts")
dummy_input = torch.randn((1,3,320,320)))
dummy_prediction = my_compiled_model(dummy_model)
TRTorch also supports FP16, for which you only need to specify:
trtorch_settings = {
"inputs":[
trtorch.Input(
min_shape=[1,3,320,320],
opt_shape=[1,3,320,320],
max_shape=[1,3,320,320],
dtype=torch.half,
)
],
"enabled_precisions": {torch.half},
"debug":True,
}
Using FP16 allows for lower inference time and a lower memory footprint. For some models, this can enable a higher input resolution with the same memory impact.
We ran the benchmarking code used for the PyTorch models defined in the first part for both FP32 and FP16 TRTorch models. As a result, we obtain the following results:
In FP16, TRTorch provides a significant speedup over PyTorch. However, surprisingly, we find that using TRTorch in FP32 is slower than PyTorch. We strongly recommend FP16 over FP32.
Conclusion
TRTorch is a very efficient way to reduce the inference time of a PyTorch model. Because of the exceptional performance it provides, it is used extensively at Photoroom and by all the industry. In upcoming articles, we will cover how to convert custom layers and explore other quantization options such as INT8.