Skip to content
sagi.org

Buckets and objects are not enough

Amazon S3 recently turned 20 years old. I’ve been using it since 2008, and it remains my favorite form of cloud storage: cheap, scalable, durable, and fast enough for an enormous range of use cases.

At this point, many companies use it as the de facto storage fabric for almost everything: tabular data, unstructured logs, ML training datasets, media files, backups, exports, and plenty of things that were never designed all that carefully to begin with.

S3 organizes data into buckets, each containing a virtually unlimited number of objects. In practice, many teams end up with large buckets that contain data of many different kinds: outputs of multiple systems, ad hoc exports, historical leftovers, and objects uploaded manually by humans or scripts nobody quite remembers anymore.

Within those buckets, there are often groups of objects that clearly belong together. A logical table stored as Parquet. Logs written by a particular system. Media files associated with a specific project. A backup set. A training corpus. I’ll call these datasets.

The problem is that S3 does not have a first-class way to express that these objects belong together.

Prefixes are doing more work than they were meant to

The usual answer is naming conventions. Most teams organize related objects by prefix, meaning they share the same leading path. The / character is often used as a delimiter, creating something that looks like a folder hierarchy.

This is only a convention, of course. AWS has always been clear that S3 has no folders, and that / is not special. You can store a billion objects in a bucket with no meaningful naming scheme at all, and as far as S3 itself is concerned, that works just fine.

That is technically true. In practice, it does not really hold.

People organize by prefix anyway. Not because S3 needs the hierarchy, but because the humans reading the bucket do. It is how people make sense of a billion objects.

But not all prefixes play the same role.

Suppose you have a path like:

s3://acme/lake/analytics/sessions_v1/

and suppose this is a partitioned Parquet table. It might contain tens of thousands of objects under paths like:

year=2025/month=11/day=02/client=android/part-00000.parquet

The most specific prefix level is then:

s3://acme/lake/analytics/sessions_v1/year=2025/month=11/day=02/client=android/

That is probably too specific for most management purposes. It is a partition. Useful as an implementation detail, but usually not the thing someone wants to own, govern, archive, or reason about independently.

If you go too high, to something like:

s3://acme/lake/

or:

s3://acme/lake/analytics/

you start grouping unrelated datasets together.

So in this example, the meaningful boundary is:

s3://acme/lake/analytics/sessions_v1/

That is the dataset.

The prefixes above it, like /lake/ and /lake/analytics/, are organizational prefixes. They help group related datasets and form a natural hierarchy.

The prefixes below it are implementation detail. They matter to the engine reading the data, but from a storage-management perspective, they are usually not the right unit of meaning.

What breaks without a dataset abstraction

If S3 does not know what a dataset is, a lot of basic storage-management problems become harder than they should be.

If you look at a bucket, you cannot natively get a list of the datasets it contains, along with their size, cost, growth, ownership, or usage. You can get prefix-level size and growth via Storage Lens or S3 Metadata, but those tools still do not tell you which prefix is the actual dataset, which is an organizational container, and which is implementation detail.

You also cannot archive, restore, or delete a dataset as a unit.

You can do some of these things to a prefix, but that only gets you so far, because S3 does not tell you which prefix is the meaningful boundary of a dataset in the first place. Given a particular object, it does not tell you what level is the right one to reason about. Sometimes that level is obvious. Often it is not.

A prefix may encode some semantic meaning, but usually not enough. Take the earlier example of:

/lake/analytics/sessions_v1/

What is it, exactly? Who created it? Is it still in use? How does it differ from v2 or v3? Is it a replacement, a backfill, an experiment, or a dead version nobody cleaned up?

S3 has nowhere natural to store that kind of metadata at the dataset level.

You can attach tags to a bucket, but a bucket often contains many unrelated datasets. You can also attach tags to individual objects, but that is not the same abstraction. Duplicating shared metadata across every object in a dataset is expensive, cumbersome, and easy to get wrong. Even if you do it, keeping it consistent over time becomes its own problem.

So in practice, the metadata that matters tends to live somewhere else: in naming conventions, in code, in spreadsheets, in internal systems, or simply in people’s heads. Over time, that knowledge fragments and decays.

Why adjacent tools fall short

Catalogs are one partial answer, especially in the tabular world. But they usually require each dataset to be registered explicitly. In practice, teams only do this for the subset of datasets that are actively used and worth referring to by name rather than by S3 path.

That leaves a large part of the estate uncovered.

And even for the datasets they do cover, catalogs are usually disconnected from the physical storage layer. They typically do not tell you what a dataset costs, how large it is, or how it is growing. They are rarely the place where you safely archive, tier, or delete the underlying storage. The catalog is a parallel naming system you keep in sync with storage. It describes datasets but does not participate in managing them.

The hardest part is often not enriching a known dataset. It is discovering that the dataset exists at all. Which is why a registry layered above storage cannot really close this gap. It assumes the dataset already exists and just needs a name. Most data in object storage was never declared.

S3 Tables, AWS’s managed Iceberg offering, gets closer for tabular data. A table is a first-class storage primitive you can list, query, and manage as a unit. But it requires explicit table creation, lives in a separate bucket type, and leaves the rest of the estate — logs, media, exports, backups, ad hoc data — untouched.

Security tools touch a different slice of this problem. They look for sensitive material and verify that it sits behind the right controls and access policies. But that still does not tell you why the data is there, what broader dataset it belongs to, whether it is still needed, or whether keeping it around is the right decision. A dataset can be perfectly secured and still be unnecessary, orphaned, or poorly understood.

Each of these tools approaches the gap from a different angle. None of them close it, because S3 itself does not have the dataset as a primitive.

Which is another way of saying: the system has objects, but the organization thinks in datasets.

The gap is structural

Netflix, managing exabytes of data in S3, presented at re:Invent about how they layer custom ETL on top of Storage Lens to roll up storage data at every prefix depth. Pinterest, with nearly an exabyte of data across billions of objects, built an in-house pipeline on top of S3 Inventory and access logs to track storage footprint, access patterns, and ownership across all their datasets.

Companies that big can absorb the cost of building their own layers on top of S3. Most cannot. The shape of the pain is the same; the difference is whether you have the engineering capacity to paper over it. The gap is structural, not organizational.

The same gap shows up in Google Cloud Storage and Azure Blob Storage. Different platforms, same missing abstraction.

Cost is usually a symptom

Organizations should be able to account for the data they have. Not just the important datasets, and not just the ones that are actively queried, but eventually all of it. Right now, the tooling is not good enough to make that practical. Even teams that deeply value metadata and governance struggle, because the tools are not operating at the right abstraction.

FinOps tools address part of this problem, but they mostly approach it through the lens of cost reduction. Most stop at the bucket level, some operate at the object level (for example, Intelligent-Tiering), and a select few have basic prefix-level capabilities. Almost none have a notion of the dataset as a first-class unit, or a place to attach the broader metadata that would make a prefix understandable in context.

My experience with these tools is that people usually use them to detect the worst offenders: look at some reports, find a few prefixes that cost too much, try to figure out whether they are still needed, and then delete or archive what seems safe. Celebrate the 10% reduction. Six months later, when the budget is blown again, the exercise repeats.

Cost is usually a symptom of an underlying management and governance problem. The core issue is that you have data sitting around that nobody can confidently account for. You may not know what it is, who created it, whether it is still being accessed, or when it might be needed again.

As Corey Quinn put it on Screaming in the Cloud:

You had a data scientist copy away five petabytes to do a quick experiment for two weeks. She left the company in 2012 and, oops, a doozy. We probably should have cleaned that up before.

And even when it does come onto your radar, that still does not mean you can safely delete it. If nobody understands what it is, the safer move is often to move it into Glacier and move on.

But that is really just sweeping the problem under the rug. It may reduce the immediate cost pain, but it makes it more likely that the same unknown data will still be around five years from now. If you do not know what it is today, you surely won’t know then.

What seems missing

What seems missing is a layer that discovers the meaningful datasets already inside a bucket and treats them as first-class entities. It would let you attach metadata — some of it inferred automatically — and operate on storage at that level. Not a parallel registry layered on top of storage, but something connected to it: grounded in the physical layout rather than detached from it.

The structure needed to distinguish a real dataset from an organizational prefix is mostly already in the data — partition formats, file layouts, naming conventions, access patterns. It just has not been turned into a layer anyone can use.

And whatever boundaries that layer identifies need to be durable — across ingestion runs, format changes, reorganizations. A boundary that gets re-derived from scratch every week is not an entity. It is a guess.

That layer has to work on existing buckets, with existing data, without requiring every dataset to be registered manually or every writer to be changed in advance. Discovery has to come first. Registration, if it exists at all, should come later.

I’ve been thinking about this problem for a while, and I’m now building something around it. More on that when there’s more to say. In the meantime, if you’ve dealt with this kind of mess in your own storage estate — or if anything here resonates and you’d want to talk — I’d be glad to hear from you at [email protected].



Next Post
Coming back