Hacker News new | ask | show | jobs
by wingman-jr 988 days ago
For a side project of image classification, I use a simple folder system where the images and metadata are both files, with a hash of the image acting as a key/filename - e.g. 123.img and 123.metadata. This gives file independence. Then as needed, I compile a CSV of all the image-to-metadata as needed and version that. Works because I view the images as immutable, which is not true for some datasets. On a local SSD, it has scaled to >300K images. Professionally, I've used something similar but with S3 storage for images and Postgres database for the metadata. This scales up better beyond a single physical machine for team interaction of course. I'd be curious how others have handled data costs as the datasets grow. The professional dataset got into the terabytes of S3 storage and it gets a bit more frustrating when you want to move data but are looking at thousands of dollars projected costs for egress of the data... and that's with S3, let alone a more expensive service. In many ways S3 is so much better than a hard drive, but it's hard not to compare to the relative cost of local storage when the gap gets big enough.