What you can do

Top AI tools

AI Backgrounds

AI Images

AI Expand

Edit your photos

Change background color

Resize image

Add white background

Batch Editing

Save hours by editing in batch

Instantly remove the most complex backgrounds from multiple images at once - directly in the app or with our API.

Industries

Use cases

Automated post production

Teams

For your business

Customer Stories

Discover how enterprises, small businesses and entrepreneurs achieve professional results with Photoroom.

API

Generate Background API

Remove Background API

Image Editing API

View all

Resources

Test live (Playground)

Automate your business

Discover the Photoroom API

Scale your visual content production with Photoroom's API for automated image editing across industries.

Learn more

Inside Photoroom4 times faster image segmentation with TRTorch

4 times faster image segmentation with TRTorch

Matthieu ToulemontSeptember 13, 2021

At Photoroom, we build photo editing apps. One of our core features is to remove the background of any image. We do so by using deep learning models that will segment an image into different layers.

In our quest to improve the speed and the accuracy of our models, we quickly took the habit of compiling them using Nvidia's TensorRT. As detailed in the chart below, leveraging TensorRT's optimized GPU kernels through TRTorch can yield a performance boost of up to 10x:

For very low resolution (160px), the speedup can be up to 10x and up to 1.6x faster for larger resolutions

This blog post details the necessary steps to optimize your PyTorch model for the fastest inference speed:

Part I: Benchmarking your original model's speed
Part II: Boosting inference speed with TRTorch

For a more complete code example check the following notebook.

Part I: Benchmarking your model's speed

To evaluate your original model's speed, we will need to ensure we're using the right settings and environment.

1- CuDNN benchmark mode

If your model uses a fixed input size, you can speed up your inference by enabling the cuDNN benchmark mode as follows:

import torch
torch.backends.cudnn.benchmark = True

2- Beware of asynchronous executions

By default, the GPU tasks run asynchronously, which means that measuring the inference as done below will not work.

start = time.time()
prediction = my_model(dummy_input)
end = time.time()

By the time we arrive at the third line, we have no idea whether the model computations are over or not. More details here. The simplest way to fix this is to force the synchronisation:

start = time.time()
prediction = my_model(dummy_input)
torch.cuda.synchronize()
end = time.time()

3- Use NVIDIA NGC deep learning framework containers.

NVIDIA provides out-of-the-box Docker containers equipped with PyTorch and TensorRT that yield better performance, especially in FP16 mode. Our experience shows up to 1.5x gains when compared to running outside of the Docker image.

4- Aggregating results over multiple runs

As detailed in this StackOverflow answer, to get the best estimate of your model's runtime you should use the minimum time over several runs instead of the average. Some runs may be slowed down by factors unrelated to the model (e.g. noise).

5- Other tricks

Do not forget to put your model in eval mode
PyTorch recently introduced an inference_mode. Like no_grad, it may also be useful to optimize for speed and memory.
If your model has loads of Conv followed directly by BatchNorm layers, you can fuse them beforehand. Depending on the model, this will result in a 10-20% speedup. TRTorch does it automatically.

A simple benchmarking function would look like:

def benchmark(model, resolution, dtype, device):
    dummy_input = torch.ones(
        (1,3,resolution, resolution),dtype=dtype,device=device
    )

    # Warm up runs to prepare Cudnn Benchmark
    for warm_up_iter in range(10):
        prediction = model(dummy_input)
    
    # Benchmark
    with torch.no_grad():
        durations = list()
        for i in range(100):
            start = time.time()
            prediction = model(dummy_input)
            torch.cuda.synchronize()
            end = time.time()
            durations.append(end-start)
    return min(durations)

Part II: Boosting inference speed with TRTorch

TRTorch is a library built on top of TensorRT, which provides an easy way to compile your PyTorch model into a TensorRT graph. The compilation is ahead of time, meaning that the optimisations happen before the first inference run.

The compiled model can then be used through torch.jit as you would for any scripted/traced model in PyTorch. Also, your compiled model can be used directly from C++ as it is independent of Python.

To compile your PyTorch model, you first need to script/trace it. The corresponding TorchScript module is then fed into TRTorch's compiler along with compilation settings such as:

Input shapes for each input
Operation precision (FP32, FP16, INT8)

1- Setting up TRTorch locally

The easiest way to set up TRTorch locally is to use one of the docker images provided. To do so you can execute the following commands:

git clone https://github.com/NVIDIA/TRTorch.git 

cd TRTorch

sudo docker build -f docker/Dockerfile.21.03 -t trtorch:dev

sudo docker run \
    --gpus all \
    -it \
    --shm-size=40gb \
    --env="DISPLAY" \
    -v /home:/home \
    --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
    --name=trtorch \
    --ipc=host \
    --net=host trtorch:dev

The image comes with all the necessary packages installed (torch, torchvision, trtorch, jupyter, etc...)

2- Scripting with TorchScript

TorchScript saves your model's operation graph in a format that can be executed outside of Python.

Note that if your forward loop uses if/else statements, tracing won't work as it only records the particular flow of operations for the input you provide. For tracing to work, branching in your code must not depend on the input.

net = torchvision.models.resnet101()
net.eval()
dummy_input = torch.randn((1,3,320,320)))
traced_model = torch.jit.trace(net, dummy_input)

3- Compiling a TorchScript module with TRTorch

trtorch_settings = {
    "inputs":[
        trtorch.Input(
            min_shape=[1,3,320,320],
            opt_shape=[1,3,320,320],
            max_shape=[1,3,320,320],
            dtype=torch.float,
        )
    ],
    "enabled_precisions": {torch.float},
    "debug":True, 
}

trtorch_model = trtorch.compile(traced_model, settings)
torch.jit.save(trtorch_model, "./my_trtorch_model.ts")

Once your model is compiled, you can use it through torch.jit.load:

my_compiled_model = torch.jit.load("./my_trtorch_model.ts")
dummy_input = torch.randn((1,3,320,320)))
dummy_prediction = my_compiled_model(dummy_model)

TRTorch also supports FP16, for which you only need to specify:

trtorch_settings = {
    "inputs":[
        trtorch.Input(
            min_shape=[1,3,320,320],
            opt_shape=[1,3,320,320],
            max_shape=[1,3,320,320],
            dtype=torch.half,
        )
    ],
    "enabled_precisions": {torch.half},
    "debug":True, 
}

Using FP16 allows for lower inference time and a lower memory footprint. For some models, this can enable a higher input resolution with the same memory impact.

We ran the benchmarking code used for the PyTorch models defined in the first part for both FP32 and FP16 TRTorch models. As a result, we obtain the following results:

In FP16, TRTorch provides a significant speedup over PyTorch. However, surprisingly, we find that using TRTorch in FP32 is slower than PyTorch. We strongly recommend FP16 over FP32.

Conclusion

TRTorch is a very efficient way to reduce the inference time of a PyTorch model. Because of the exceptional performance it provides, it is used extensively at Photoroom and by all the industry. In upcoming articles, we will cover how to convert custom layers and explore other quantization options such as INT8.

Matthieu ToulemontSenior Machine Learning Engineer @ Photoroom