| I had a similar issue at my last job. Whenever a user created a PR on our open source project artifacts of 1GB size consisting of hundreds of small files would be created and uploaded to a bucket. There was just no process that would ever delete anything. This went on for 7 years and resulted in a multi-petabyte bucket. I wrote some tooling to help me with the cleanup. It's available on Github: https://github.com/someengineering/resoto/tree/main/plugins/...
consisting of two scripts, s3.py and delete.py. It's not exactly meant for end-users, but if you know your way around Python/S3 it might help. I build it for a one-off purge of old data. s3.py takes a `--aws-s3-collect` arg to create the index. It lists one or more buckets and can store the result in a sqlite file.
In my case the directory listing of the bucket took almost a week to complete and resulted in a 80GB sqlite. I also added a very simple CLI interface (calling it virtual filesystem would be a stretch) that allows to load the sqlite file and browse the bucket content, summarise "directory" sizes, order by last modification date, etc. It's what starts when calling s3.py without the collect arg. Then there is delete.py which I used to delete objects from the bucket, including all versions (our horrible bucket was versioned which made it extra painful). On a versioned bucket it has to run twice, once to delete the file and once to delete the then created version, if I remember correctly - it's been a year since I built this. Maybe it's useful for someone. |