Double PyTorch Inference Speed for Diffusion...

NVIDIA TensorRT is an AI inference library built to optimize machine learning models for deployment on NVIDIA GPUs. TensorRT targets dedicated hardware in modern architectures, such as NVIDIA Blackwell Tensor Cores, to accelerate common operations found in advanced machine learning models. It can also modify AI models to run more efficiently on specific hardware by using optimization techniques such as layer fusion and automatic kernel tactic selection.

Popular frameworks such as PyTorch provide an intuitive and consistent interface for working with AI models; however, they can’t always achieve peak performance. Torch-TensorRT bridges this gap. It’s a powerful compiler for PyTorch models, delivering TensorRT-level performance on NVIDIA GPUs, while maintaining PyTorch class usability. It enables you to double performance over native PyTorch without requiring changes to PyTorch APIs.

In this blog post, we’ll show how Torch-TensorRT makes optimization incredibly straightforward, unlocking significant acceleration with minimal code changes. Using just a single line of code, performance on FLUX.1-dev, a 12-billion-parameter rectified flow transformer, increases to 1.5x compared to native PyTorch FP16. Additionally, by applying a simple FP8 quantization procedure on top of that, performance increases to 2.4x.

We’ll also show how to use Torch-TensorRT to support advanced diffusers workflows, such as low-rank adaptation (LoRA) through on-the-fly model refit.

Double PyTorch Inference Speed for Diffusion...

Model acceleration

One-line optimization with a mutable Torch-TensorRT module

Transparently supporting LoRAs

Quantization