| Its weird how S3 seems to be the unwanted stepchild of AWS. So many obvious innovations just aren't turning up. For example, strangely, AWS introduced tagging for S3 resources, but you can't search/filter by tag, nor is the tag even returned when you get a list of objects, you can only get the tag with an object request. The word "pointless" springs to mind. In fact it's strange that there is NO useful filtering at all apart from the very useful folder/hierarchy/prefix filtering. But apart from that you can't do wildcard searches or filters or date filters or tag filters. I'm building an application right now that needs to get a list of all the jpg files - the only way to do that is get every single object in the bucket and manually filter out the unwanted ones - feels like its 1988 again. It seems like it would also be valuable for there to be alternate interfaces to S3 such as the ability to send data via ftp or SMTP or sftp or whatever, but there are no such interfaces. Hopefully Google will goad AWS into action on S3 innovation by implementing such features. |
I learned this the hard way: We had an application where made the mistake of storing about a billion files in a nearly flat structure — one level of nesting, probably 100m "folders" in the root. Then one day we needed to go through it to prune stuff that was no longer in use. Unfortunately, if you don't have a "shardable" prefix, list requests are impossible to parallelize efficiently (because you can't subdivide the work), and our scripts took weeks to run to completion. Hard-earned experience: If you're storing large quantities of stuff in S3, always pick a shardable prefix. The upload date is a good choice. A random string will also do.
After this, my solution for any non-trivially-sized storage use case is to store an inventory of objects separately in a performant PostgreSQL database, and make sure all writes go through a service layer that shields the consumer from the details of S3. This has some benefits over a hypothetical centralized approach (but some downsides, like the possibility that things get out of sync if you sidestep the inventory). Overall, I wish S3 would store its metadata in something like BigQuery.
Anyone know if Google Cloud Platform's S3 equivalent, Cloud Storage, improves on these issues?