KoldOps
June 8, 2026 · KoldOps

The 6-Question Context Engine Audit (And What Your Score Means)

Six questions that diagnose whether your AI system has a real context engine or a collection of half-built components. Score in 15 minutes. What each band of 0 to 6 means and what to fix first.

The context engine audit is a 6-question diagnostic. Each question maps to one of the six components of a real context engine. Score 0 or 1 per question. Total honestly. Most production AI systems score 1 or 2 of 6, which means the team has shipped one or two components and a marketing layer that calls the result a "context engine."

This page is the deep treatment of each question. What a "yes" looks like in practice. What a "no" looks like. The common shapes of every failing answer. The action to take at each score band.

For the broad framing on what a context engine is and why most teams build one by accident, see Every AI Team Builds a Context Engine. Most Don't Realize It.

Question 1. Can your AI agent read every relevant system of record without manual export?

Pick any document or data element the agent might need to answer a real production question. A customer record, a quality manual, a contract clause, a piece of operational data. Ask: can the agent retrieve it on demand, at request time, without a human exporting it to a file first?

Yes looks like: a set of connectors that pull from the systems of record on a schedule or on demand. The agent's retrieval surface includes the latest data without manual intervention. New connectors are added in days, not months.

No looks like: "we exported a snapshot to S3 last month." Or: "the agent only knows about the documents we manually uploaded to the vector DB." Or: "we have a Notion export running quarterly." The agent sees stale data, partial data, or no data depending on which system holds the answer.

Common shapes of no: connectors exist for one or two systems and were never extended; the team treats data ingestion as a one-time project; the systems-of-record team and the AI team do not coordinate, so the agent gets a third-party crawl that misses half the content.

Question 2. Does the agent use hybrid retrieval, not just vector similarity?

Trace one query the agent receives. Look at the retrieval call. Is it pulling from a vector index alone, or is it combining BM25 (lexical) with vector similarity and, ideally, a graph layer that captures document relationships?

Yes looks like: hybrid retrieval with at least two search modalities (BM25 plus vector at minimum), a reranker on top, and source attribution attached to every retrieved chunk. The retrieval is observable and tunable.

No looks like: a single vector-search call returning the top-K embeddings, with no rerank, no lexical fallback, no graph traversal. The agent receives the documents the embedding model happened to think were similar, regardless of whether they are actually relevant.

Common shapes of no: the team picked one vector DB vendor early, defaulted to its API, and never integrated a second retrieval mode; BM25 was considered "old-school" and skipped; the reranker was scoped for "v2" that never shipped.

Question 3. Is there a versioned, reviewed canonical record the retrieval queries against?

This is the substrate question. Pick a piece of content the retrieval just returned. Look at its source. Does that source have a version history with named authors? Did the most recent change pass through a review gate before becoming canonical?

Yes looks like: markdown files in git, with pull-request workflow, named reviewers, and merge gates. Every change is attributable. Every change is reviewable. The history is a coherent timeline, not a sequence of opaque writes.

No looks like: the source is a wiki anyone can edit without review; or a PDF nobody knows the provenance of; or a SaaS document whose history is hidden behind a paid tier.

Common shapes of no: the team adopted a vector DB, pointed it at the company wiki, and never asked whether the wiki itself was a substrate; the documents being indexed are exported from a system that overwrites silently; the "current version" of an important document depends on which folder you look in.

For a deeper treatment of this specific component, run the substrate audit. The substrate is the most common failure point inside a context engine.

Question 4. Does the agent remember prior turns of a conversation across sessions?

Pick a recent conversation an agent had with a user. Start a new session. Ask a question that requires context from the prior session. Does the agent know what was discussed, what was decided, what was tried?

Yes looks like: a session-state store indexed by user and project, with the relevant turn-by-turn history loaded into the context window of each new call. The agent picks up where it left off. It does not re-ask questions it has already answered.

No looks like: every new session starts cold. The user re-explains the project, re-uploads the same documents, re-states the constraints. The agent treats each conversation as if it were the first.

Common shapes of no: the team thought "memory" was a feature of the LLM and assumed the vendor's memory tool was enough; the agent has per-session context but loses it at session boundary; conversation history is stored but never retrieved into the active context window.

Question 5. Is the agent's access to tools and context exposed over an open protocol?

In 2026 the protocol is almost always MCP (Model Context Protocol). Pick the agent. Look at how it accesses tools, context, and the substrate. Is the access mediated by a published, stable, vendor-agnostic protocol? Or is it a tangle of bespoke HTTP endpoints and SDK wrappers specific to one agent vendor?

Yes looks like: the agent (whether Claude, GPT, Llama, or a future model) connects to the context engine over MCP or an equivalent open protocol. Switching agents requires reconfiguring the agent's connection, not rebuilding the context engine.

No looks like: the integration was built specifically for the agent vendor in use. A second agent vendor requires duplicate integration work. Internal tools have custom HTTP shims unique to the application.

Common shapes of no: the team built against the Anthropic SDK directly, before MCP was mature, and has not migrated; the team uses LangChain or another framework that abstracts the protocol but adds its own lock-in; the agent's tool definitions are inline strings in application code instead of a queryable MCP server.

Question 6. Is there a process that detects when operations have drifted from the recorded substrate?

The substrate says X. The operations are doing Y. Who notices? When? How fast?

Yes looks like: a drift detector (manual periodic audit, automated comparison, or a dedicated drift-detection tool) that flags divergence between what the substrate records and what the operation actually does. The flag goes to a named owner who responds within a defined window.

No looks like: nobody checks until an external event (an audit, a customer complaint, a regulatory finding, an incident) reveals the gap. The substrate and reality silently desync until somebody is forced to look.

Common shapes of no: the team built the substrate and assumed it would stay current because they hoped people would maintain it; drift detection was scoped for "phase 2" that has not happened; the operations team and the substrate team report to different leaders and do not share metrics.

Scoring bands

Score What it means What to do
5 or 6 A real context engine. The components are wired and operating. AI deployed on top of it can ship. Maintain. Quarterly audit. Add capacity to the weakest component. Watch for the component you scored 0 on, if any.
3 or 4 Partial context engine. The shipped components produce useful agent behavior on the workloads they cover. The missing components limit what the agent can do. Identify the missing components. Fill in priority order. The substrate (question 3) is usually the highest-impact gap because the others depend on it.
0 to 2 No context engine. The "context engine" is actually a vector DB plus a crawler, marketed internally as more. The agent's output is what that combination produces. Build the substrate first (question 3). Wire one connector to one system of record. Stand up MCP. The other components are easier once the substrate is real.

The most common failure mode

Across teams that ship AI features into production, the typical 2-of-6 score has the same shape: question 2 (retrieval is hybrid) scores 0.5 because vector-only retrieval was picked early and never extended; question 3 (substrate) scores 0 because the team treated the wiki as the substrate without auditing it; question 4 (session state) scores 0 because the team relied on the vendor's memory tool; questions 5 and 6 (protocol, drift detection) score 0 because they were scoped for "v2."

The 2 points come from question 1 (connectors exist for at least the obvious systems) and a partial credit on question 2 (a vector DB is wired in). Nothing wrong with that as a starting point. Everything wrong with calling it a context engine and stopping there.

How this audit interacts with the substrate audit

The substrate audit is the deep diagnostic for question 3 alone. The substrate is the load-bearing component inside the context engine; if the substrate score is 0 or 1, the context engine score is bounded above by it, regardless of how well the other components are built. Run the substrate audit if you scored 0 on question 3 and want to understand what is missing.

The context engine audit is the broader, system-level diagnostic. Run it when you are evaluating the whole AI integration, not just the data layer.

Frequently asked

What if my team only has one component shipped (a vector DB)?

Then you have one component shipped. That is not a context engine; it is one of the six pieces of one. Calling it a context engine internally creates a planning trap because the team thinks the work is done. Score honestly. Plan for the missing five.

Does using LangChain or LlamaIndex count as "context engine, off the shelf"?

Those are frameworks. They give you tools to build a context engine. They do not, by themselves, satisfy any of the 6 questions. A LangChain application that scores 4 of 6 is a 4-of-6 context engine. A LangChain application that scores 1 of 6 is a 1-of-6 context engine. The framework is not the score.

Can I score 6 of 6 without ever buying a vendor product?

Yes. The 6 components are all available as open-source or commodity infrastructure. Connectors can be written in Python. Hybrid retrieval is BM25 (Lucene, Tantivy, Postgres FTS) plus vector (pgvector, LanceDB, FAISS) plus optional graph (FalkorDB, Neo4j Community). Substrate is markdown plus git. Session state is Postgres or Redis. MCP server is open source. Drift detection can be a cron job that compares substrate to operational metrics. The assembly is the work; the components are not gated.

What if my AI is internal-tooling only and the scores do not seem to matter?

The scores matter the day the internal AI returns a confidently wrong answer that someone acts on. Internal-tooling AI is the same architecture as customer-facing AI; it just has lower observability when it fails. If the agent's output is consequential, the score matters.

What's next

If you scored 0 to 2, start with the substrate. Question 3 is the load-bearing component. Run the substrate audit against your current data layer. Build the substrate. The other components are easier once the substrate is real.

If you scored 3 to 4, identify the missing components and fix them in priority order. The right order is usually: substrate, then session state, then drift detection, then protocol, then hybrid retrieval, then connectors. Connectors look like the obvious starting point but are usually the easiest to fix later.

If you scored 5 to 6, maintain. The components rot. Quarterly audit keeps them current.

For a second opinion on the score before acting on it, the Business System Review is the fixed-scope engagement. We score the components, identify the priority order, and hand back a written report you can act on without hiring us for anything else.