NVIDIA TensorRT-LLM deserialization flaws expose distributed inference control paths

critical

CVE

CVE-2025-33255, CVE-2026-24142

CWE

CWE-502

Affected Surface

NVIDIA TensorRT-LLM before 1.2
PyPI package tensorrt-llm before 1.2
TRT-LLM MPI server deployments
Distributed TensorRT-LLM inference and RLHF workflows that deserialize IPC or weight handles
Self-hosted model-serving clusters where MPI/control-plane traffic can be influenced by untrusted tenants

NVIDIA published two TensorRT-LLM deserialization CVEs on 20 May 2026. Both affect NVIDIA/TensorRT-LLM before 1.2; the corresponding Python package is tensorrt-llm.

CVE-2025-33255: unsafe deserialization in the TRT-LLM MPI server path.
CVE-2026-24142: unsafe deserialization of serialized handles.

NVD currently scores both as CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H with a 9.8 critical base score. NVIDIA’s CNA vectors are lower and local-privilege-oriented: 7.5 for CVE-2025-33255 and 6.3 for CVE-2026-24142. That scoring split is important for triage. The practical risk depends on whether an attacker can influence the MPI/control-plane data, serialized IPC handles, or orchestration paths that move model state across ranks and workers.

The vulnerable component is not a web framework endpoint. It is the distributed inference control plane that sits behind LLM serving and RLHF workflows. In a single-tenant lab machine, that may require local access. In a self-hosted model platform, shared GPU cluster, notebook environment, or multi-tenant inference service, the same primitive can sit close to tenant-controlled model, job, or worker inputs.

Affected package and project

Affected:

project: NVIDIA/TensorRT-LLM
package: tensorrt-llm
vulnerable range: versions before 1.2
fixed baseline: 1.2 or later, with 1.2.1 or newer preferred where available

Inventory should include container images and internal wheels, not just direct requirements.txt entries. TensorRT-LLM deployments are commonly packaged into GPU-serving images, so an application may be exposed even when the application repository does not directly name tensorrt-llm.

Why pickle matters here

Python pickle is not a data format in the JSON or protobuf sense. It is a bytecode format for reconstructing Python objects, and reconstruction can import modules and call attacker-selected functions. A minimal unsafe pattern looks like this:

import pickle

obj = pickle.loads(untrusted_bytes)

An attacker-controlled pickle can define a reducer that maps deserialization to process execution:

class RunsCode:
    def __reduce__(self):
        import os
        return (os.system, ("id >&2",))

That illustrative gadget is not TensorRT-LLM-specific; it is the bug class. Any code path that treats a pickle blob as ordinary structured data has to prove that the blob is trusted and that all callable reconstruction targets are safe. TensorRT-LLM’s fixes show that the vulnerable paths involved exactly this kind of trust boundary.

Serialized weight handles

The clearest public code trail is in tensorrt_llm/llmapi/rlhf_utils.py. Before the fix, update_weights() accepted serialized IPC handles indexed by GPU device UUID and decoded a base64 string directly into pickle.loads():

serialized_handles = ipc_handles[device_uuid]
if isinstance(serialized_handles, str):
    all_handles = pickle.loads(base64.b64decode(serialized_handles))

That is dangerous because ipc_handles is a control-plane object. In distributed inference and RLHF flows, it represents cross-process handles for model weights, not user-facing business data. But if an attacker can influence that object through a compromised worker, malicious job, poisoned orchestration state, or exposed management API, the deserialize step becomes a code-execution boundary.

NVIDIA replaced the direct pickle call with a restricted unpickler in tensorrt_llm.serialization:

decoded_data = base64.b64decode(serialized_handles)
all_handles = serialization.loads(
    decoded_data,
    approved_imports=approved_imports,
    approved_module_patterns=[r"^torch.*"],
)

if not isinstance(all_handles, list):
    raise ValueError(
        f"Deserialized data must be a list, got {type(all_handles).__name__} instead"
    )

The first fix moved from unrestricted object loading to an allowlist model. Later hardening tightened the torch.* module-pattern approval into explicit class and dtype allowlists for the tensor-handle objects TensorRT-LLM actually needs. That second step matters: broad module regexes reduce exposure but still leave a large import surface. For deserialization, the safest allowlist is as small as the expected object graph permits.

MPI server risk

CVE-2025-33255 covers the MPI server side of TRT-LLM. MPI is often treated as trusted internal plumbing, but in modern model-serving systems it can sit behind schedulers, worker launchers, dynamic scale-out, notebook jobs, and service meshes. The key question is not “is this Internet exposed?” The key question is “can an untrusted principal cause bytes to reach the MPI server’s deserialization path?”

High-risk patterns include:

shared GPU clusters where different tenants can submit TensorRT-LLM jobs;
inference platforms that let customers upload or select model artifacts, adapters, or job definitions;
Kubernetes deployments where compromised pods can reach MPI or worker control ports;
CI or benchmark runners that execute untrusted model-serving experiments;
internal APIs that accept serialized handles, checkpoint metadata, or worker state and forward them into TRT-LLM.

If those boundaries exist, a deserialization bug in control-plane code is not just a local hardening issue. It can become lateral movement from an ordinary model job into the serving worker account, and from there into model weights, credentials, GPUs, or adjacent services.

Remediation

Upgrade TensorRT-LLM to 1.2 or later. Prefer 1.2.1 or the newest stable release available for your CUDA, driver, TensorRT, and GPU platform. Rebuild serving containers after the package update; do not rely on changing only the application layer if tensorrt-llm is baked into a base image.

Check Python environments:

python -m pip show tensorrt-llm
python - <<'PY'
import importlib.metadata
print(importlib.metadata.version("tensorrt-llm"))
PY

For containers:

docker run --rm your-image python -m pip show tensorrt-llm

For self-maintained forks or vendor-patched builds, inspect the deserialization call sites. The vulnerable shape is unrestricted pickle loading from values that can cross process, worker, model, or tenant boundaries:

rg "pickle\.loads|pickle\.load" tensorrt_llm
rg "serialization\.loads" tensorrt_llm/llmapi

The fixed pattern should use a restricted unpickler, narrow allowlists, and type validation after deserialization. Do not replace the fix with ad hoc string checks around pickle input; pickle payloads are bytecode, not text.

Hardening

Treat TensorRT-LLM control traffic as privileged:

restrict MPI and worker control ports to the serving nodes that require them;
isolate tenant jobs with Kubernetes NetworkPolicy, security groups, or equivalent controls;
avoid sharing a TRT-LLM worker account across trust boundaries;
prevent ordinary model-upload or benchmark paths from passing raw serialized Python objects into serving workers;
log worker restarts, unexpected MPI session creation, and deserialization failures;
rotate credentials present in serving containers if there is evidence that an untrusted job reached the vulnerable paths before patching.

The broader lesson is that AI serving frameworks increasingly contain their own distributed systems: schedulers, IPC, MPI, tensor handles, plugin loading, and model-artifact pipelines. These are application-security surfaces. They deserve the same trust-boundary review as HTTP handlers and queue consumers, especially when they deserialize Python objects.

From research to remediation

Check whether this pattern exists in your codebase

Turn this research into a remediation workflow. Scan dependencies and package manifests for similar supply-chain risk, then prioritize fixes with reachability context.

Explore dependency scanning Get Demo