Is S3 actually hierarchical? I always took the mental model that the S3 object namespace within a bucket was flat and the treatment of ‘/‘ as different was only a convenient fiction presented in the tooling, which is consistent with the claim in this article.
Followup question: Is there something special about the PRE notations in the example output below? I can list objects by any textual prefix, but I can't tell if the PRE (what we think of as folders) is more efficient than just the substring prefix.
Full bucket list, then two text prefix, then an (empty) folder list
sokoloff@ Downloads % aws s3 ls s3://foo-asdf
PRE bar-folder/
PRE baz-folder/
2022-02-17 09:25:38 0 bar-file-1.txt
2022-02-17 09:25:42 0 bar-file-2.txt
2022-02-17 09:25:57 0 baz-file-1.txt
2022-02-17 09:25:49 0 baz-file-2.txt
sokoloff@ Downloads % aws s3 ls s3://foo-asdf/ba
PRE bar-folder/
PRE baz-folder/
2022-02-17 09:25:38 0 bar-file-1.txt
2022-02-17 09:25:42 0 bar-file-2.txt
2022-02-17 09:25:57 0 baz-file-1.txt
2022-02-17 09:25:49 0 baz-file-2.txt
sokoloff@ Downloads % aws s3 ls s3://foo-asdf/bar
PRE bar-folder/
2022-02-17 09:25:38 0 bar-file-1.txt
2022-02-17 09:25:42 0 bar-file-2.txt
sokoloff@ Downloads % aws s3 ls s3://foo-asdf/bar-folder
PRE bar-folder/
I don't understand the answer to that question either. Other AWS docs says you can choose whatever you want for a delimiter, there's nothing special about `/`. So how does that apply to what they say about performance and "prefixes"?
> For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second.
Related to your question, even if we just stick to `/` because it seems safer, does that mean that "foo/bar/baz/1/" and "foo/bar/baz/2/" are two prefixes for the point of these request speed limits? Or does the "prefix" stop at the first "/" and files with these keypaths are both in the same "prefix" "foo/"?
Note there was (according to docs) a change a couple years ago that I think some people haven't caught on to:
> For example, previously Amazon S3 performance guidelines recommended randomizing prefix naming with hashed characters to optimize performance for frequent data retrievals. You no longer have to randomize prefix naming for performance, and can use sequential date-based naming for your prefixes.
The ListObjects api will omit all objects that share a prefix that ends in the delimiter, and instead put said prefix into the CommonPrefix element, which would be reflected as PRE lines. (So with a delimiter of '/', it basically hides objects in "subfolders", but lists any subfolders that match your partial text in the CommonPrefix element).
By default `aws s3 ls` will not show any objects within a CommonPrefix but simply shows a PRE line for them. The cli does not let you specify a delimiter, it always uses '/'. To actually list all objects you need to use `--recursive`.
The output there would suggest that bucket really did have object names that began with `bar-folder/`, and that last line did not list them out because you did not include the trailing slash. Without the trailing slash it was just listing objects and CommonPrefixes that match the string you specified after the last delimiter in your url. Since only that one common prefix matched, only it was printed.
There are some tricks on Linux. For example using mv into a trash dir instead of rm. I’ve also seen some successful use of rsync that does real deletion many times faster than rm -rf, not sure why but guessing some parallelism is involved.
Google for this problem. There are surprisingly many creative ideas, many which also surprisingly are a lot better than the built in rm command.
I believe it's mostly a problem of latency between your machine and S3. Since each Delete call is issued separately in its own HTTP connection.
1. Try parallelization of your calls. Deleting 20 objects in parallel should take the same time as deleting 1.
2. Try to run deletion from an AWS machine in the same region as the S3 bucket (yes buckets are regional, only their names are global). Within-datacenter latency should be lower than between your machine and datacenter.
(This is a good example where Garbage Collection wins over schemes which track reference explicitly, like reference counting. A garbage collector can just throw away the reference, while other schemes need to visit every leaf resulting in hours of deletion time in some cases.)
The issue is this isn’t free. I played and emended up with a few hundred million object S3 bucket on a personal project and am trying to get rid of it without getting a bill. Seriously considering just getting suspended from aws if that’s a viable path lol.
The filesystem is hierarchical, but the delete operation still needs to visit all the leaves.