Tool Discovery for 4,800 MCP Tools — Without a Vector Database

Jun 11, 2026

Axel Dunkel

On our production ToolMesh instance, asking discover_tools for everything NetBox-related returned about 320 KB of TypeScript declarations. That is roughly 80,000 tokens — for many models half the context window, spent before the agent has called a single tool. Asking for all tools would have produced output on the order of 2 MB.

Nobody designed it to do that. It grew into it.

How a discovery call gets to 320 KB

ToolMesh is a self-hosted MCP gateway that speaks Code Mode: backends are exposed as flat JavaScript functions (toolmesh.netbox_list_devices(...)), and a discover_tools meta-tool returns TypeScript signatures so the model can write correct calls. With a handful of backends, returning every matching signature is exactly right.

Our instance is no longer a handful. It runs 49 backends with 4,767 tools: NetBox alone contributes 608, two MikroTik switches 297 each, two OPNsense firewalls 292 each, GitHub 203, Cloudflare 200. Each full signature averages ~530 bytes — NetBox list endpoints carry a dozen or more filter parameters.

Three effects compound at this scale:

Full declarations are the only output mode. Every match renders with its complete parameter list.
The search regex matches names and descriptions. A pattern like device fans out across NetBox, UniFi, MikroTik, and Shelly at once.
Multi-instance backends share prefixes. Searching mikrotik returns both switches — 594 declarations.

So a perfectly reasonable agent query — “show me the NetBox tools” — detonates the context window.

The fix everyone expects

The reflexive 2026 answer is: embed the tool descriptions, put them in a vector database, do semantic retrieval. We run that pattern elsewhere and it earns its keep there.

For this problem we deliberately said no, for three reasons.

The corpus is tiny and static. ~4,800 short, well-named, well-described documents that only change on a config reload. That is an in-memory index rebuilt in milliseconds — not a persistence problem. Adding PostgreSQL/pgvector to a single-binary, self-hosted Go gateway to search five thousand strings is infrastructure out of all proportion. Even the lean path — an embedded ONNX model — drags in a native runtime library, which is real packaging pain for software other people deploy themselves.

The vocabulary is already aligned. Tool names and descriptions are written for LLM consumption. Lexical ranking has excellent material to work with.

Most importantly: the agent is the semantic layer. An LLM that sees a compact, ranked result list with an honest summary line reformulates its own query in one turn. It does not need a perfect search. It needs a cheap iteration loop and feedback about what it is looking at.

What we shipped instead

Three changes, all in ToolMesh v0.3.1+, zero new dependencies.

1. Output that degrades before it detonates

discover_tools now scales its detail level to the number of matches: up to 25 matches return full TypeScript signatures, up to 250 return one-line summaries, up to 2,000 return names only, and anything beyond that returns per-backend counts. A detail parameter overrides the automatic choice, and a hard 50 KB cap backstops every tier.

The former 2 MB worst case — no filter at all — now answers in about 1.5 KB:

// ToolMesh tools — per-backend overview
rest:netbox — 608 tools
rest:mikrotik-sw-lii-labor-perlman — 297 tools
rest:mikrotik-sw-mikrotik-10g — 297 tools
rest:opnsense_backup — 292 tools
...

// ── 4767 of 4767 tools matched — detail: overview (auto).
// Refine: narrower pattern, query:"<free text>" for ranked results,
//         detail:"full"|"summary"|"names"|"overview", limit:N.

That footer is the most important line in the feature. Every response states how much of the catalog matched, how much is shown, and how to refine. The agent never mistakes a truncated view for the whole world — and it self-corrects in the next call instead of guessing.

2. Ranked free-text search — BM25 in ~200 lines of Go

Regex patterns are great for exact lookups (^netbox_list), bad for exploration. The new query parameter does ranked free-text search:

discover_tools({ query: "create dns record for a zone" })

returns the top 25 by relevance — Cloudflare, INWX, and Linode DNS tools, in that mixed order, with no NetBox noise. Under the hood is a plain BM25 index over tool names (boosted), descriptions, and parameter names — about 200 lines of dependency-free Go, rebuilt per call in microseconds. Parameter names being indexed means a query like rack_id finds netbox_list_devices even though “rack” appears nowhere in the tool name.

BM25 is fifty-year-old information retrieval. That is not a weakness; at this corpus size it is the entire point.

3. Discovery inside the sandbox

The biggest structural win: agents no longer need a separate discovery round-trip at all. Inside execute_code, two local helpers are available:

// find → inspect → call, in ONE round trip
const hits = toolmesh.discover("move task to kanban bucket", 5);
const schema = toolmesh.describe(hits[0].name);
return await toolmesh[hits[0].name]({
  project_id: 2, view_id: 8, task_id: 200, bucket_id: 5
});

toolmesh.discover() and toolmesh.describe() run against the in-memory index — no backend call, no context cost beyond what the code explicitly returns, exempt from the per-execution call budget, and filtered by the caller’s authorization like every other surface. Discovery becomes something the code does, not something the context pays for.

What live testing caught within twenty minutes

We deployed the branch to production and drove it from a real agent session. Two bugs surfaced almost immediately — both invisible to unit tests:

A stale client schema turned limit into a string. The MCP client had cached the old discover_tools schema, so it serialized the unknown parameter limit: 5 as the string "5" — which the server silently ignored. Clients you do not control will hold stale schemas; agent-facing parameters now accept numeric strings.

Summaries truncated at “e.g.” The one-line summarizer cut at the first period-plus-space, producing Set device key properties (e.g. — a period now only ends a sentence when followed by an uppercase letter.

Neither is glamorous. Both are the difference between a feature that demos well and one that survives contact with real clients.

Prior art, and what changes at scale

Progressive disclosure for tools is not our invention. Anthropic’s code execution with MCP post describes loading tool definitions on demand; Claude Code defers tool schemas until needed; Cloudflare’s Code Mode made the write-code-against-tools pattern mainstream. The pattern is settling into consensus.

What production scale adds is the parts that rarely show up in demos: ranking (because 764 candidates match “dns” somewhere), an always-on feedback line (because agents refine well but only when told what they saw), hard output caps (because some client will always ask for everything), and in-sandbox discovery (because the cheapest context is the one never spent).

What’s next

The same index has two more consumers coming: a capability index for nested Code Mode backends — MCP servers that themselves only expose search/execute, whose capabilities ToolMesh will probe and index at registration — and synthetic tool descriptions generated from it at the MCP layer. And if real-world vocabulary misses ever justify it, an optional embedding channel can slot in behind the same query API — measured first, added second.

The implementation is open source (PR #80, Apache 2.0). The tool definitions it searches come from DADL — one YAML file per API, 25+ ready-made in the registry.

If you run an MCP setup that has outgrown its tool list: we would genuinely like to hear what discovery looks like at your scale — GitHub Discussions or [email protected].