Hacker News new | ask | show | jobs
by ebingdom 1589 days ago
I'm confused about prefixes and sharding:

> The files are stored on a physical drive somewhere and indexed someplace else by the entire string app/events/ - called the prefix. The / character is really just a rendered delimiter. You can actually specify whatever you want to be the delimiter for list/scan apis.

> Anyway, under the hood, these prefixes are used to shard and partition data in S3 buckets across whatever wires and metal boxes in physical data centers. This is important because prefix design impacts performance in large scale high volume read and write applications.

If the delimiter is not set at bucket creation time, but rather can be specified whenever you do a list query, how can the prefix be used to influence where objects are physically stored? Doesn't the prefix depend on what delimiter you use? How can the sharding logic know what the prefix is if it doesn't know the delimiter in advance?

For example, if I have a path like `app/events/login-123123.json`, how does S3 know the prefix is `app/events/` without knowing that I'm going to use `/` as the delimiter?

4 comments

The prefix isn't delimited, it's an arbitrary length based on access patterns.

A fictitious example which is close to reality:

In parallel, you write a million objects each to:

   tomato/red/...
   tomato/green/...
   tomatoes/colors/...
The shortest prefixes that evenly divides writes are thus

   tomato/r
   tomato/g
   tomatoes
If you had an existing access pattern of evenly writing to

   tomatoes/colors/...
   bananas/...
The shortest prefixes would be

   t
   b
So suddenly writing 3 million objects that begin with a t would cause an uneven load or hotspot on the backing shards. The system realizes your new access pattern and determines new prefixes and moves data around to accommodate what it thinks your needs are.

--

The delimiter is just a wildcard option. The system is just a key value store, essentially. Specifying a delimiter tells the system to transform delimiters at the end of a list query like

   my/path/
into a pattern match like

   my/path/[^/]+/?
Thank you! This is the first explanation that I think fully explains what I was confused about. So essentially the prefix is just the first N bytes of the object's name, where N is a per-bucket number that S3 automatically decides and adjusts for you. And it has nothing to do with delimiters.

I find the S3 documentation and API to be really confusing about this. For example, when listing objects, you get to specify a "prefix". But this seems to be not directly related to the automatically-determined prefix length based on your access patterns. And [1] says things like "There are no limits to the number of prefixes in a bucket.", which makes no sense to me given that the prefix length is something that S3 decides under the hood for you. Like, how do you even know how many prefixes your bucket has?

[1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimi...

The sharding key is an implementation detail, so you're not supposed to care about it too much.
That's true now. Used to be the case that they'd recommend random or high-entropy parts of the keys go at the beginning to avoid overloading a shard as you described above.

From [0]:

> This S3 request rate performance increase removes any previous guidance to randomize object prefixes to achieve faster performance. That means you can now use logical or sequential naming patterns in S3 object naming without any performance implications. This improvement is now available in all AWS Regions. For more information, visit the Amazon S3 Developer Guide.

[0]: https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3...

Indeed, and unfortunately my mind will forever work this way.
It is related, in the sense both “prefixes” are a substring match anchored at the start of the object name. They’re just not the same mechanism.
> So suddenly writing 3 million objects that begin with a t would cause an uneven load or hotspot on the backing shards.

makes sense

> The system realizes your new access pattern and determines new prefixes and moves data around to accommodate what it thinks your needs are.

What does "determines new prefixes" mean? Obviously AWS isn't going to come up with new prefixes and change object names.

So does AWS maintain prefix-surrogates (prefix sub-string(0,?) references) and those are what actually gets shuffled around to handle the new unbalanced workload? Sort of like resharding?

Moreover, since it's really prefix-surrogates being used, the recommendation of randomizing prefixes can be replace with randomizing prefix-surrogates and delegated to AWS, removing the prior responsibility from the customer. Hence the 2018 announcement https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3...

There’s no delimiter. There is only the appearance of a delimiter, to appease folks who think S3 is a filesystem, and fool them into thinking they’re looking at folders.

The object name is the entire label, and every character is equally significant for storage. When listing objects, a prefix filters the list. That’s all. However, S3 also uses substrings to partition the bucket for scale. Since they’re anchored at the start, they’re also called prefixes.

In my view, it’s best to think of S3’s object indexing as a radix tree.

This article, as if you couldn’t guess from the content, is written from a position of scant knowledge of S3, not surprising it misrepresents the details.

So if I have a bunch of objects whose names are hashes like 2df6ad6ca44d06566cffde51155e82ad0947c736 that I expect to access randomly, is there any performance benefit to introducing artificial delimiters like 2d/f6/ad6ca44d06566cffde51155e82ad0947c736? I've seen this used in some places.
To AWS S3, '/' isn't a delimiter, it's a character that's part of the filename.

So for instance "/foo/bar.txt" and "/foo//bar.txt" are different files in S3, even though they'd be the same file in a filesystem.

This gets pretty fun if you want to mirror a S3 structure on-disk, because the above suddenly causes a collision.

No difference other than readability. And amazon may distribute your application with another prefix anyway, like "2d/f6/ad6c"
I don't know what impact that partitioning pattern has on s3, but it has some obvious benefits if your app needs to revert to write to a normal filesystem instead (like for testing).
>There’s no delimiter.

What's the delimiter parameter for then?

https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObje...

To provide a consistent API response as part of the ListObjects call. It has nothing to do with the storage on disk.
To help you fool yourself. It affects how object list results are presented in the api response.
"To help you fool yourself" seems like a euphemism for "to fool you". It's gotta be tough to go from "scant knowledge of S3" to genuine knowledge if the documentation is doing this to you.

If the docs are misrepresenting the details, who can blame the author of the post?

The documentation is very clear on the purpose of the delimiter parameter.

The OP does not read the docs, makes bad assumptions repeatedly throughout, and then reaps the consequences.

They can’t present a directory abstraction for list operations without a delimiter. E.g. CommonPrefixes.
This is where GCP's GCS (Google Cloud Storage) shines.

You don't need to mess with prefixing all your files. They auto level the cluster for you [1].

[1] https://cloud.google.com/storage/docs/request-rate#redistrib...

AWS does the optimizations over time based on access patterns for the data. Should have made that clearer in the article.

The problem becomes unusual burst load - usually from infrequent analytics jobs. The indexing cant respond fast enough.

Thanks for the clarification. But now I'm confused about the limits:

> 3,500 PUT/COPY/POST/DELETE requests per second per prefix

> 5,500 GET/HEAD requests per second per prefix

Most of those APIs don't even take a delimiter. So for these limits, does the prefix get inferred based on whatever delimiter you've used for previous list requests? What if you've used multiple delimiters in the past?

Basically what I'm trying to determine is whether these limits actually mean something concrete (that I can use for capacity planning etc.), or whether their behavior depends on heuristics that S3 uses under the hood.

I'm fine with S3 optimizing things under the hood based on access my patterns, but not if it means I can't reason about these limits as an outsider.

S3 does a lot of under the hood optimisation. e.g. Create a brand new bucket, leave it cold for a while, and start throwing 100 PUT requests a second at it. This is way less than the advertised 3500, but they'll have scaled the allocated resources down so much you'll get some TooManyRequests errors.
Those are what I would assume for performance when the system is stable. The concerns come from bursty behaviour — for example, if you put something new into production you might have a period of time while S3 is adjusting behind the scenes where you'll get transient errors from some operations before it stabilizes (these have almost always been resolved by retry in my experience). This is reportedly something your AWS TAM can help with if you know in advance that you're going to need to handle a ton of traffic and have an idea of what the prefix distribution will be like — apparently the S3 support team can optimize the partitioning for you in preparation.
Delimiter isn’t used for writes, only list operations.

S3 simply looks at the common string prefixes in your object names and uses that to internally shard objects, so you can achieve a multiple of those request limits.

aaa122348

aaa484585

bbb484858

bbb474827

Would have same performance as:

aaa/122348

aaa/484585

bbb/484858

bbb/474827