Vector DBs Aren't Storage. They're Indexes.
A vector database is an index. The storage is somewhere else, usually a Postgres table or an S3 bucket no one talks about. The category that is actually missing is storage for AI.
A vector database is an index. The storage is somewhere else, usually a Postgres table or an S3 bucket nobody talks about. The confusion costs every AI team a quarter of architecture time and most of their disaster-recovery posture. The fix is to stop calling indexes storage and to start naming the category that is actually missing: storage for AI.
The category confusion
Pinecone, Weaviate, Chroma, Qdrant, Milvus, pgvector. Listed in any architecture diagram for a serious AI system. Described, often, as "the storage layer." They are not. They are indexes built over embeddings, optimized for approximate nearest-neighbor search. The embeddings come from somewhere. The documents the embeddings were generated from live somewhere. The metadata that joins them lives somewhere. None of those somewheres is the vector DB.
If you delete your vector DB tomorrow, you re-embed and re-index from the source documents. If you delete your source documents, no vector DB recovers you. That asymmetry is the definition of storage versus index.
What storage actually requires
Five properties. A real storage layer for AI workloads satisfies all 5. Most vector databases satisfy 1.
- Versioned. Every write produces a diff. Authors and timestamps are first-class. You can read the system as it was on any prior date.
- Reviewable. Writes pass through a gate, the same way code passes through a pull request. Append-only is not enough. A canonical version exists at every moment.
- Retrieval-native. BM25, vector similarity, and graph traversal are wired in, not bolted on with a different vendor each.
- Replayable. Point-in-time queries answer "what did this system know on March 14." Audit defensibility depends on it.
- LLM-native. The storage format is the prompt format. Markdown the model can read and write, not an opaque binary nobody can diff.
Test your vector DB against the 5. Most score 0. Some score 1, for "retrieval-native," and only for the vector portion of it. None of them version writes by author. None of them gate writes through review. None of them replay state at a prior timestamp. The format is opaque binary. They are indexes.
What index actually means
An index is a derived data structure that makes lookups fast against a primary source. A B-tree on a Postgres column. An inverted index in Lucene. An HNSW graph over embeddings in Pinecone. The defining property: the index can be rebuilt from the primary source. If the index is lost, the data is not lost. If the primary source is lost, the index becomes a useless artifact.
This is not a criticism. Indexes are essential. Postgres without B-trees is unusable at scale. Search without an inverted index is grep on a corpus. Embedding retrieval without an HNSW index is a linear scan over a million vectors. Indexes do real work. They are just not the layer where the data lives.
Where the actual storage hides
Look at the architecture diagram of any production system that uses a vector DB. Trace the data flow backward from the embedding. The embedding came from a document. The document was chunked. The chunks were sourced from a Notion workspace, a Github repository, a Drive folder, a Confluence wiki, an internal CMS, or an export from an ERP.
That is where the storage is. The vector DB holds derived representations of those documents. The documents themselves live in:
- Postgres or MySQL, behind some application's schema
- S3, R2, or GCS buckets, as files
- SaaS APIs (Notion, Drive, Confluence) that crawl jobs hit on a cron
- A markdown folder on someone's laptop, synced to a private repo
The hidden storage layer in most AI architectures is whichever of the above happens to be hosting the source documents. It was never designed for AI workloads. It was designed for human reading, or for a different application's CRUD pattern, or for nothing in particular. The vector DB papers over the mismatch by indexing whatever the crawler can reach. The mismatch remains.
The category that is actually missing
An AI workload needs a persistence layer with the 5 properties above. The market does not currently sell one. Vendors sell indexes (Pinecone, Weaviate), memory layers (Mem0, Letta), or general-purpose object stores (S3) and assemble them with glue.
Call the missing category storage for AI. The shape it has to take is determined by the 5 properties, not by any specific vendor's roadmap:
- Markdown documents on disk for human and LLM readability
- Git for version control, author attribution, and review gates
- Hybrid retrieval (BM25 plus vector plus graph) wired in, not a separate vendor
- A point-in-time query interface that can replay the substrate's state at any prior timestamp
- An open protocol (MCP) for tool interconnect, so any agent can read the storage fluently
None of these are exotic. Each technology has existed for at least a decade. The novelty is assembling them, with discipline, as one storage layer rather than as five separate concerns. That assembly is what the AI substrate is.
When to keep using a vector database
The argument is not that vector databases are bad. They are excellent at the job they actually do. Use one when:
- You need approximate nearest-neighbor search over more than 10 million vectors with sub-100ms latency
- You have a managed-service preference and the cost works for your workload
- The thing being retrieved is genuinely best-described as a vector (image embeddings, audio fingerprints)
Use the storage-for-AI layer underneath it, regardless. The vector DB is the index. The storage is the substrate. Conflating them is a billing surprise waiting to happen on the day a vendor changes pricing, and a disaster-recovery hole on the day someone deletes the wrong project.
Frequently asked
Is pgvector storage or an index?
pgvector is an index extension to Postgres. The storage is the Postgres table the embeddings sit in, plus whatever table or external file holds the source documents. pgvector is closer to "storage and index in one place" than the standalone vector DBs only because Postgres is already a real storage layer. The retrieval portion is still an index.
What about object storage like S3 or R2?
S3 is excellent storage for objects. It is not storage for AI. It does not version writes by author. It does not gate writes through review. It is not retrieval-native (you bring your own search layer). It is not replayable in the point-in-time-query sense without significant custom work. Object storage is a useful primitive inside a storage-for-AI layer, not a substitute for one.
Why is "retrieval-native" a storage property and not an indexing property?
Because the storage layer's job is to answer the agent's question, not to hand the agent a raw blob and require it to bring its own search vendor. If retrieval is bolted on, the storage and the index drift out of sync, latency compounds across hops, and replayability becomes impossible because the index's state at time T is no longer recoverable from the storage's state at time T. Retrieval has to be a first-class storage concern.
What does "LLM-native" mean as a storage requirement?
The storage format is the same format the model reads at inference time. Markdown is LLM-native. Parquet files of token IDs are not. An opaque binary blob you have to deserialize through a vendor's SDK is not. The test: can a fresh LLM agent open a file from the storage layer and use it directly, without a translation step? If yes, the layer is LLM-native. If no, it is not.
What's next
Run the 5-property test against your current architecture this week. Map each layer to versioned, reviewable, retrieval-native, replayable, LLM-native. Most teams find their stack scores 1 out of 5 across the whole pipeline, with the 1 being "retrieval-native" and only for the vector portion. That gap is the work.
The full philosophy lives in the pillar: Decision-State, Airlocked to Code-State: Defining the AI Substrate.