Hacker News new | ask | show | jobs
by DrJokepu 4226 days ago
That being said, if your relational database has 1GB records, you really need to reevaluate whether you are using the right tool for the job.
2 comments

While I kind of agree, what exactly is the "right tool" for storing large blobs?

You can use the filesystem but then you have to manage data migration/sync by "hand" which is kind of a bitch for distributed systems.

Honestly storing large blobs with meta-data telling you about the blobs is extremely common. I'm yet to hear of how you're meant to do it correctly.

I had to solve this problem at my last job and we had vendors let us know how companies like Apple, Sony, Disney, EA etc have solved it. Basically there are two ways to do it. (1) Store it on some "filesystem". In quotes because how you do it can vary wildly e.g. S3, GlusterFS, Standard directories using DRBD for HA. (2) Take the blob, slice it into pieces, hash it and spread it across a sharded database.

Generally it seems to be that if you have lots of unique large files then use filesystem. But if you have files which are likely to have duplicates then use a database. So a file storage locker may use (1) but a service like iTunes Match would use (2). And IIRC Apple in fact does store at least uploaded music files in Cassandra.

We found that a GlusterFS filesystem is a good way to go for our case of 100s of terabytes of unique files with some failover. Amazon S3 would simply cost far more. The file metadata is in the RDBMS, obviously. But storing the binary data in the RDBMS wouldn't be a good idea, and would choke long before it got to our scale.
Usually storing in filesystem with a reference in database seems to work best. Much easier for database management and backup, and file management as well.

The OP doesn't appear to have hear of SQL Server FileStream or FileTables, which handle this abstraction for you: http://msdn.microsoft.com/en-us/library/ff929144.aspx

On pgsql, the blob type. It stores the object out of the row store, but still in the database.
commonly, s3 or zfs nas for storage and pg for metadata
Commonly ? No not even close. The most common solution is that people just dump it as a BLOB in whatever database they are currently using. Which based on numbers is likely to be MySQL.
That's basically what a client of mine does.
+1