Hacker News new | ask | show | jobs
by lloesche 1589 days ago
I had a similar issue at my last job. Whenever a user created a PR on our open source project artifacts of 1GB size consisting of hundreds of small files would be created and uploaded to a bucket. There was just no process that would ever delete anything. This went on for 7 years and resulted in a multi-petabyte bucket.

I wrote some tooling to help me with the cleanup. It's available on Github: https://github.com/someengineering/resoto/tree/main/plugins/... consisting of two scripts, s3.py and delete.py.

It's not exactly meant for end-users, but if you know your way around Python/S3 it might help. I build it for a one-off purge of old data. s3.py takes a `--aws-s3-collect` arg to create the index. It lists one or more buckets and can store the result in a sqlite file. In my case the directory listing of the bucket took almost a week to complete and resulted in a 80GB sqlite.

I also added a very simple CLI interface (calling it virtual filesystem would be a stretch) that allows to load the sqlite file and browse the bucket content, summarise "directory" sizes, order by last modification date, etc. It's what starts when calling s3.py without the collect arg.

Then there is delete.py which I used to delete objects from the bucket, including all versions (our horrible bucket was versioned which made it extra painful). On a versioned bucket it has to run twice, once to delete the file and once to delete the then created version, if I remember correctly - it's been a year since I built this.

Maybe it's useful for someone.

2 comments

What about the lifecycle stuff?

I thought, S3 can move stuff to cheaper storage automatically after some time.

Like I wrote for us it was a one-off job to find and remove 6+ year old build artifacts that would never be needed again. I just looked for the cheapest solution of getting rid of them. I couldn't do it by prefix alone (prod files mixed in the same structure as the build artifacts) which is why delete.py supports patterns (the `--aws-s3-pattern` arg takes a regex).

If AWS' own tools work for you it's surely the better solution than my scripts. Esp. if you need something on an ongoing bases.