Hacker News new | ask | show | jobs
by Tehchops 1589 days ago
We’ve got data in S3 buckets not nearly at that scale and managing them, god forbid trying a mass delete, is absolute tedium.
3 comments

Mass delete also takes an eternity on my Linux desktop machine.

The filesystem is hierarchical, but the delete operation still needs to visit all the leaves.

Is S3 actually hierarchical? I always took the mental model that the S3 object namespace within a bucket was flat and the treatment of ‘/‘ as different was only a convenient fiction presented in the tooling, which is consistent with the claim in this article.
This is mostly correct, with the additional feature that S3 can efficiently list objects by "key prefix" which helps preserve the illusion.
Followup question: Is there something special about the PRE notations in the example output below? I can list objects by any textual prefix, but I can't tell if the PRE (what we think of as folders) is more efficient than just the substring prefix.

Full bucket list, then two text prefix, then an (empty) folder list

  sokoloff@ Downloads % aws s3 ls s3://foo-asdf            
                             PRE bar-folder/
                             PRE baz-folder/
  2022-02-17 09:25:38          0 bar-file-1.txt
  2022-02-17 09:25:42          0 bar-file-2.txt
  2022-02-17 09:25:57          0 baz-file-1.txt
  2022-02-17 09:25:49          0 baz-file-2.txt
  sokoloff@ Downloads % aws s3 ls s3://foo-asdf/ba
                             PRE bar-folder/
                             PRE baz-folder/
  2022-02-17 09:25:38          0 bar-file-1.txt
  2022-02-17 09:25:42          0 bar-file-2.txt
  2022-02-17 09:25:57          0 baz-file-1.txt
  2022-02-17 09:25:49          0 baz-file-2.txt
  sokoloff@ Downloads % aws s3 ls s3://foo-asdf/bar
                             PRE bar-folder/
  2022-02-17 09:25:38          0 bar-file-1.txt
  2022-02-17 09:25:42          0 bar-file-2.txt
  sokoloff@ Downloads % aws s3 ls s3://foo-asdf/bar-folder
                             PRE bar-folder/
I don't understand the answer to that question either. Other AWS docs says you can choose whatever you want for a delimiter, there's nothing special about `/`. So how does that apply to what they say about performance and "prefixes"?

Here is some AWS documentation on it:

https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimi...

> For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second.

Related to your question, even if we just stick to `/` because it seems safer, does that mean that "foo/bar/baz/1/" and "foo/bar/baz/2/" are two prefixes for the point of these request speed limits? Or does the "prefix" stop at the first "/" and files with these keypaths are both in the same "prefix" "foo/"?

Note there was (according to docs) a change a couple years ago that I think some people haven't caught on to:

> For example, previously Amazon S3 performance guidelines recommended randomizing prefix naming with hashed characters to optimize performance for frequent data retrievals. You no longer have to randomize prefix naming for performance, and can use sequential date-based naming for your prefixes.

Umm... that output seems confusing.

The ListObjects api will omit all objects that share a prefix that ends in the delimiter, and instead put said prefix into the CommonPrefix element, which would be reflected as PRE lines. (So with a delimiter of '/', it basically hides objects in "subfolders", but lists any subfolders that match your partial text in the CommonPrefix element).

By default `aws s3 ls` will not show any objects within a CommonPrefix but simply shows a PRE line for them. The cli does not let you specify a delimiter, it always uses '/'. To actually list all objects you need to use `--recursive`.

The output there would suggest that bucket really did have object names that began with `bar-folder/`, and that last line did not list them out because you did not include the trailing slash. Without the trailing slash it was just listing objects and CommonPrefixes that match the string you specified after the last delimiter in your url. Since only that one common prefix matched, only it was printed.

Use the delete-objects instead and it will be much faster, as you can supply up to 1000 keys to remove per a single API call.

https://awscli.amazonaws.com/v2/documentation/api/latest/ref...

Most recursive deletion routines are not optimized for speed. This could be done much faster with multiple threads or batching the calls via io_uring.

Another option are LVM or btrfs subvolumes which can be discarded without recursive traversal.

There are some tricks on Linux. For example using mv into a trash dir instead of rm. I’ve also seen some successful use of rsync that does real deletion many times faster than rm -rf, not sure why but guessing some parallelism is involved.

Google for this problem. There are surprisingly many creative ideas, many which also surprisingly are a lot better than the built in rm command.

I believe it's mostly a problem of latency between your machine and S3. Since each Delete call is issued separately in its own HTTP connection.

1. Try parallelization of your calls. Deleting 20 objects in parallel should take the same time as deleting 1.

2. Try to run deletion from an AWS machine in the same region as the S3 bucket (yes buckets are regional, only their names are global). Within-datacenter latency should be lower than between your machine and datacenter.

(This is a good example where Garbage Collection wins over schemes which track reference explicitly, like reference counting. A garbage collector can just throw away the reference, while other schemes need to visit every leaf resulting in hours of deletion time in some cases.)
Set a lifecycle rule to delete your objects. Come back a day later and AWS will have taken care of this for you.
The issue is this isn’t free. I played and emended up with a few hundred million object S3 bucket on a personal project and am trying to get rid of it without getting a bill. Seriously considering just getting suspended from aws if that’s a viable path lol.
Lifecycle rules are free. Use them to empty the bucket.
"You are not charged for expiration or the storage time associated with an object that has expired."

From: https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecy...

Very true: it took me about a month of emptying, deleting and life cycling about a dozen buckets of about 20 TB (~20 million objects) to get to zero.