An MCP server passes every unit test in your repo. The JSON Schema validates, the tools return what they advertise, the integration tests exercise the happy path.
Then Anthropic ships Sonnet 4.8. Or your customer switches from Claude to GPT. Or somebody bumps the model version in a config file you don't own. The agent that drove your tests last week suddenly picks the wrong tool, or the right tool with arguments that look right but aren't, or skips the catalog entirely and tries to do the job from memory.
None of that is a bug in your server. None of it is a bug in the model. It's a regression in the contract between them, and the contract is something you can't write down because half of it lives inside the model's weights. Your MCP server has a hard dependency on every model that might call it, and that dependency keeps shipping new versions.
The model is the consumer
An HTTP API is called by code that a human wrote and reviewed. The OpenAPI spec is the contract, the calling pattern is fixed at deploy time, and if the spec changes, the build breaks loudly in front of a person who can read the stack trace.
An MCP server is called by a language model. The model interprets a system prompt, a conversation history, and a tool catalog it has never seen before. It decides which tool to pick, what arguments to fill in, when to retry, when to give up, and when to lie about whether the call succeeded. None of those decisions live in your server's code. They live in the weights of whatever model is talking to your server today.
Put differently: an MCP server is the public half of a contract whose private half is the model's prompt-time reasoning. You can unit-test the public half all day. The interesting failures live where the two halves meet, and the private half changes every time a frontier lab pushes a release.
What model drift actually changes
When a new model ships, four things shift, and none of them shift in a way your current test suite can see.
Tool selection probability. The same prompt that picked search_docs on the old model picks search_web on the new one, because the new model's description embeddings sit slightly differently. Your most-called tool changes overnight, and your dashboards show the same number of total calls.
Argument shape preferences. The new model fills in optional fields the old model left blank. Or it stops filling in a field your server quietly required. Or it starts wrapping query strings in JSON arrays because that's what its training data preferred. Your validators still pass. Your results get worse.
Retry and error-handling style. The old model retried twice and gave up. The new model retries six times with mutated arguments, or it gives up after one failure and apologizes to the user, or it falls back to a different tool you didn't expect it to know about.
Emergent behaviors. The new model plans before it acts. Or it refuses prompts the old model accepted. Or it produces JSON in a slightly different style that breaks your downstream parser. These don't appear in release notes. They appear in production.
None of these are bugs the model lab introduced. They're consequences of training a new model on slightly different data with slightly different objectives. They are also the things that will break your MCP server next week.
LLM as jury
The only test that catches any of this is one that has an LLM make the call.
The pattern is straightforward. You hand a model your MCP server's catalog, a realistic prompt, and let it drive. You record what it did: which tool it picked, what arguments it used, how it handled errors, how it interpreted the response. That recording is the artifact. You re-run the same test against the same server with a different model, or the same model on a different day, and you diff the recordings. The diff is your regression signal.
That's an LLM jury. The model is on the panel, not on the witness stand. The MCP server is the defendant. The verdict is whether the agent's behavior changed, and the test runner's job is to tell you which way.
Two variants are useful. Fix the jury, rotate the defendant. Pick a stable model (Sonnet 4.6, say) and use it to grade every change you make to your MCP server. This is a server-regression test: any diff means your edit changed how the model behaves, even if the schema looks the same. Fix the defendant, rotate the jury. Pin your MCP server and run the same tests against Sonnet 4.6, Sonnet 4.7, Opus 4.7, GPT-5, and Gemini. The diffs are your model-portability map. They tell you which model releases will require an MCP server change before you flip the customer-facing flag.
Why this is finally tractable
The LLM-jury pattern has been the right idea for a while, and it has been impractical for almost as long. Models were too expensive, too slow, and too non-deterministic to use as a CI primitive. None of those are true anymore.
Frontier-tier models are now cheap enough that running a hundred test cases against one costs less than a CI minute. Latency is low enough that a full jury sweep fits inside a normal CI step. And the non-determinism that used to make LLM testing impossible is now manageable, because we have better tooling for temperature, seeds, and snapshot-style assertions that allow for variance in surface form but pin the underlying decision. A 2026 LLM jury is a deterministic-enough fixture to put in front of a deterministic server.
What this gives you
The deliverable from an LLM jury isn't a green CI check. It's a model-drift test plan. When the next model release ships, you run the suite, you read the diff, and you know exactly what changed in how agents will call your MCP server. Tool selection shifts get flagged. New argument shapes get flagged. Retry-pattern changes get flagged. The output of a jury run is the artifact your team uses to decide whether the model upgrade is a same-day flip or a quarter-long migration.
Three more things fall out of the same suite. A multi-model panel: same server, three jurors, see who agrees and who doesn't. A latency budget per call, because slow tools are tools agents abandon, and the jury is the only thing that measures latency in agent-realistic conditions. And catalog snapshots: the list of tools the jury saw last week, and an alert when the catalog changes shape under it.
Where this lives in the stack
The three pieces of an enterprise AI stack form a feedback loop. The catalog (clictl) ships specs for tools and MCP servers. The gateway (SBproxy) governs the calls those tools make. The test runner (mcptest) puts an LLM jury in front of every server in the catalog and tells you when their behavior under real models drifts. Each one assumes the others. Catalog without tests rots into a graveyard of broken integrations. Gateway without a catalog routes traffic it can't reason about. Tests without a catalog or a gateway are unit tests with extra steps.
The agent is part of the contract. So is the model. Test both, or your MCP server's hardest dependency is one you can't see until production sees it first.