| I'm confused about prefixes and sharding: > The files are stored on a physical drive somewhere and indexed someplace else by the entire string app/events/ - called the prefix. The / character is really just a rendered delimiter. You can actually specify whatever you want to be the delimiter for list/scan apis. > Anyway, under the hood, these prefixes are used to shard and partition data in S3 buckets across whatever wires and metal boxes in physical data centers. This is important because prefix design impacts performance in large scale high volume read and write applications. If the delimiter is not set at bucket creation time, but rather can be specified whenever you do a list query, how can the prefix be used to influence where objects are physically stored? Doesn't the prefix depend on what delimiter you use? How can the sharding logic know what the prefix is if it doesn't know the delimiter in advance? For example, if I have a path like `app/events/login-123123.json`, how does S3 know the prefix is `app/events/` without knowing that I'm going to use `/` as the delimiter? |
A fictitious example which is close to reality:
In parallel, you write a million objects each to:
The shortest prefixes that evenly divides writes are thus If you had an existing access pattern of evenly writing to The shortest prefixes would be So suddenly writing 3 million objects that begin with a t would cause an uneven load or hotspot on the backing shards. The system realizes your new access pattern and determines new prefixes and moves data around to accommodate what it thinks your needs are.--
The delimiter is just a wildcard option. The system is just a key value store, essentially. Specifying a delimiter tells the system to transform delimiters at the end of a list query like
into a pattern match like