Chapter 53
TensorRT ও ONNX
TensorRT & ONNX
🔄 Train করুন PyTorch-এ, Deploy করুন সবখানে
ONNX = framework-neutral model format। TensorRT = NVIDIA GPU-এ ultra-fast inference runtime। দুটো মিলে production inference-এর সোনার জুটি।
PyTorch → ONNX Export
import torch
model.eval()
dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model, dummy, "model.onnx",
input_names=["input"], output_names=["logits"],
dynamic_axes={"input": {0: "batch"}, "logits": {0: "batch"}},
opset_version=17,
)ONNX Runtime দিয়ে Inference
import onnxruntime as ort, numpy as np
sess = ort.InferenceSession("model.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
out = sess.run(None, {"input": np.random.randn(1,3,224,224).astype("float32")})ONNX Optimization
# Graph optimizations + quantization
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic("model.onnx", "model_int8.onnx", weight_type=QuantType.QInt8)TensorRT — Build Engine
# trtexec দিয়ে দ্রুত conversion
trtexec --onnx=model.onnx --saveEngine=model.engine \
--fp16 --workspace=4096 --minShapes=input:1x3x224x224 \
--optShapes=input:8x3x224x224 --maxShapes=input:32x3x224x224# Python runtime
import tensorrt as trt, pycuda.autoinit, pycuda.driver as cuda
logger = trt.Logger(trt.Logger.WARNING)
runtime = trt.Runtime(logger)
engine = runtime.deserialize_cuda_engine(open("model.engine","rb").read())
ctx = engine.create_execution_context()Precision Options
- FP32: baseline।
- FP16: ~2x faster, প্রায় same accuracy।
- INT8: ~4x faster, calibration dataset দরকার।
- INT4 (LLM): AWQ/GPTQ — extreme compression।
Benchmark Example
# ResNet-50 (A100, batch=8) — typical
PyTorch FP32 : 12 ms
ONNX FP16 : 6 ms
TensorRT FP16: 3 ms
TensorRT INT8: 1.5 msকখন কোনটা
- ONNX Runtime: CPU/cross-platform, edge।
- TensorRT: NVIDIA GPU, lowest latency।
- OpenVINO: Intel CPU/iGPU।
- CoreML / TFLite: Apple / mobile।
⚠️ Accuracy Check
Quantization-এর পর accuracy কিছুটা পড়তে পারে। সবসময় export-এর পর validation set-এ accuracy diff মাপুন (টার্গেট <1% drop)।
সারসংক্ষেপ
✨ এই অধ্যায়ে যা শিখলাম
- ONNX = portable model format।
- TensorRT — NVIDIA-তে fastest inference।
- FP16/INT8 quantization দিয়ে বড় speedup।