KoldOps
June 7, 2026 · KoldOps

Storage for AI vs Object Storage (S3, R2, GCS): The Comparison Nobody Else Writes

S3, R2, and GCS hold bytes. They do not know what is in them. A storage for AI layer holds the canonical record an AI agent reads from. Different jobs. Often used together. Here is when and how.

Object storage (Amazon S3, Cloudflare R2, Google Cloud Storage) holds bytes. It does not know what is in them. A storage for AI layer holds the canonical record an AI agent reads from, ranked, retrieved, and reasoned against on every call. They solve different problems at different layers. Most production AI systems use both. Each is wrong without the other in the specific cases that matter.

This is the comparison nobody else writes, because most teams quietly assume "we already have S3, so we have AI storage." They do not. The mistake is invisible until disaster recovery, audit defense, or a vendor-pricing change reveals the gap.

Short answer

Use object storage to hold blobs. Drawings, photos, audio, video, model weights, log files, the original PDFs that source documents were extracted from. Object storage is a flat namespace of opaque bytes accessed by key. It scales infinitely. It costs cents. It does not interpret what is inside the bytes.

Use a storage for AI layer to hold the canonical decision record the AI agent reads on every call. Markdown documents in git, with named-reviewer gates, retrievable through hybrid search, queryable at point-in-time. The two layers serve different roles. A production system typically has both, with the storage for AI layer referencing object storage for binary attachments.

What object storage is for

Object storage is the most useful storage primitive of the last two decades. S3 launched in 2006. R2, GCS, Azure Blob Storage, MinIO, and a dozen others followed. The contract: a flat namespace of objects identified by key, accessed over HTTP, replicated for durability, and billed per gigabyte stored and per request.

What object storage is excellent at:

  • Holding bytes at infinite scale. Billions of objects, petabyte-sized buckets, eleven 9s of durability. No vendor is going to lose your bytes.
  • Storing binary content cheaply. Photos, videos, audio, CAD drawings, PDF originals, machine-learning model weights. Object storage is the lowest cost-per-gigabyte option at scale.
  • Serving as the backend for higher layers. Most modern data warehouses (Snowflake, BigQuery, Databricks) ultimately store table files in object storage. Most modern AI workloads stream training data from it.
  • Disaster recovery for blobs. Cross-region replication, lifecycle policies, immutability locks. The infrastructure exists and works.

What object storage is not designed for, despite frequent attempts to use it that way:

  • Versioned, reviewable writes with named approvers. (Bucket versioning is optional, opaque, and does not capture authorship or rationale.)
  • Retrieval. (You bring your own retrieval layer.)
  • Querying the contents of objects. (Athena, BigQuery, S3 Select exist, but they are separate compute layers, not storage features.)
  • LLM-native format awareness. (S3 has no idea whether an object is markdown, PDF, JSON, or random bytes; it neither cares nor parses.)

Object storage is a primitive. It does one thing well. Building a substrate on it requires every other layer to be added explicitly.

What a storage for AI layer is for

A storage for AI layer holds the canonical record an AI agent reads from and writes to. Five required properties: versioned (git-style), reviewable (PR-style gates), retrieval-native (BM25 + vector + graph), replayable (point-in-time queries), LLM-native (markdown the model can read directly). Full treatment in Vector DBs Aren't Storage. They're Indexes.

The category is emerging. The pattern that satisfies the 5 properties is markdown files on disk, version-controlled with git, queried through hybrid retrieval, exposed over an open protocol like MCP. The storage layer holds what the AI needs to reason against. The bytes that the markdown front-matter references (the actual PDF, the actual CAD drawing, the actual photo) typically live in object storage.

Direct comparison

Dimension Storage for AI Object Storage (S3 / R2 / GCS)
Primary roleCanonical record an agent reasons againstFlat namespace of opaque bytes
Content awarenessYes (markdown, structured)No (opaque bytes)
VersioningGit-style, with author and message per changeOptional bucket versioning; no author or rationale
Write reviewRequired, PR-style gatesNot applicable; writes are PutObject
RetrievalFirst-class: BM25, vector, graphNot provided; bring your own (Athena, S3 Select, external index)
ReplayabilityPoint-in-time query (git checkout)Per-object version IDs only; no consistent timeline
Cost modelStorage cost (cheap), git hosting (cheap or free)Storage + request charges + egress (egress is the surprise)
Typical scaleThousands to hundreds of thousands of documentsBillions of objects, petabyte buckets
Best forDecision records, policies, knowledge, documentationPhotos, video, CAD, PDF originals, model weights, log files
MaturityEmerging categoryMature since 2006

When you need object storage

Object storage is the right answer for any of these workloads, on its own:

  • Storing the binary source documents (PDFs, images, audio, video) that an AI substrate references but does not retrieve as text. Object storage is the cheapest, most durable home for bytes.
  • Holding machine-learning model weights, training corpora, and intermediate artifacts.
  • Serving as the backend for a data warehouse (Snowflake, BigQuery, Databricks) that the AI agent queries separately.
  • Long-term archive of operational logs, telemetry, or compliance records that are accessed rarely but must be retained.

None of these require the substrate to be on object storage. The object storage is the byte layer underneath whatever interprets the bytes.

When you need a storage for AI layer

A storage for AI layer is the right answer when:

  • You need a defensible audit trail for AI answers. "Why did the agent say X" must trace to a specific document version with named author and date.
  • The AI needs to answer policy or decision questions. "What is our QC threshold for this part family" requires the current policy document, retrieved fluently.
  • You need version-controlled writes for compliance (AS9100, ISO 13485, ITAR, HIPAA, SOC 2). Auditors expect to see who changed what and when.
  • You want disaster recovery that is one git clone away from a full restore.
  • You want to escape vendor lock on the substrate. Markdown in a repo you own. Tomorrow's tool reads from the same files.

For most production AI systems, the answer is "both." The substrate sits on top of (or beside) object storage. They serve different roles.

The hybrid pattern: substrate plus object storage

The reference architecture for a production AI system that has both layers, in practice:

  1. The canonical decision documents live in markdown, in git, with named-reviewer gates. This is the storage for AI layer.
  2. The markdown documents have YAML front-matter that references the binary attachments. For example: attachments: [s3://bucket/drawings/part-42-rev-c.pdf, s3://bucket/photos/inspection-2026-05-21.jpg].
  3. The binary attachments live in object storage, addressed by stable keys. The substrate's git history version-controls the references; object storage holds the bytes the references point at.
  4. The retrieval stack indexes the markdown contents (BM25 plus vector plus graph). When the agent answers a question, it pulls the relevant markdown and surfaces the attachments as references.
  5. If the agent needs to inspect a binary (read the drawing, transcribe the audio, OCR the PDF), it fetches the binary from object storage on demand, processes it with a tool, and returns the result.

This pattern keeps the substrate small (markdown is text; bytes are cheap to skip) and queryable (text-only retrieval is fast and predictable), while still letting the agent reach binary content when it actually needs it. The substrate is the reasoning layer; object storage is the byte layer.

The migration most teams need

A common starting architecture: "we put everything in S3, and we have a vector DB that crawls it." The pattern fails the moment the AI needs to answer a question that requires versioning, attribution, or review history.

The migration to a hybrid architecture:

  1. Audit the S3 bucket. Separate the binary content (PDFs, images, video, CAD) from the text content (extracted text, transcripts, summaries, written documentation).
  2. The binary content stays in S3 with the same keys. The bucket policy and lifecycle rules do not change.
  3. The text content moves to markdown files in a new git repository. The markdown front-matter references the S3 keys for the original binaries.
  4. The vector DB and retrieval stack point at the markdown repository instead of crawling the S3 bucket directly. The retrieval surface is faster and more accurate because the inputs are clean text instead of raw extraction output.
  5. The git repository becomes the substrate. The S3 bucket becomes the byte layer beneath it. Both are owned, both are portable, both survive a vendor change.

Total migration time for a substrate of a few thousand documents: 2 to 6 weeks. The application layer and the vector DB are untouched; only the source the retrieval is pointed at changes.

Frequently asked

Does S3 versioning solve the versioning property?

No. S3 bucket versioning produces a new version ID on every write, retains the prior bytes, and lets you list versions. It does not capture authorship, rationale, or review approval. It does not let you compare two versions as a diff. It does not let you walk the history as a coherent timeline across the bucket. Git does all of these. Bucket versioning is useful for accidental-deletion recovery. It is not git.

What about S3 + an external Postgres for metadata?

Better than S3 alone. The Postgres provides the metadata layer object storage lacks. It does not, by itself, provide review gates, point-in-time replay, or LLM-native format. It is a useful component inside a substrate, not a substitute for one. Most teams that adopt this pattern eventually adopt git on top, at which point the Postgres often becomes redundant for the substrate use case (it remains useful for operational metadata like access logs).

Can a storage for AI layer live on S3?

The substrate's bytes can be backed by object storage. Git supports it: most git hosting providers store the repository's bytes on object storage internally. The user-facing layer is still git's content-addressable, branchable, reviewable model. The substrate is git on top of object storage, not object storage by itself.

What about Amazon S3 + LanceDB + a custom pipeline?

That is a reasonable foundation for one of the six components inside a context engine. It is not, by itself, a storage for AI layer because it lacks the review gates, the LLM-native format, and the point-in-time queries. The team that goes this route typically reinvents git, badly, over the following 18 months. Using git directly is faster.

What about MinIO for on-premise object storage?

MinIO is excellent on-premise object storage. The same argument applies: it is the byte layer, not the substrate. An on-premise substrate uses git plus MinIO (or just git plus a filesystem) the same way a cloud substrate uses git plus S3.

What's next

If your current AI architecture is "S3 plus a vector DB," map your bucket against the 5 properties of storage for AI. Score versioned, reviewable, retrieval-native, replayable, LLM-native. The score will likely be 0 to 1 of 5, and the work plan follows.

For the conceptual treatment of why this distinction matters, see Vector DBs Aren't Storage. They're Indexes. For the comparison of storage for AI against vector databases specifically, see Storage for AI vs Vector Databases. For the substrate philosophy, see Decision-State, Airlocked to Code-State.