| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tcc619 5803 days ago

"file systems are much better at handling files"

What about batch processing a large number of small files? say 10 million image files of 500KB. A typical file system will need to seek each small file.

I wonder if GridFS stores small files in blocks to allow efficient batch retrieval for processing.

1 comments

lobster_johnson 5803 days ago

GridFS is just a standard convention of how to map files to key-value stores like MongoDB -- you can implement GridFS over MongoDB in just a few lines of Ruby code. GridFS breaks files into fixed-size chunks, and uses a single MongoDB document per chunk. It's not exactly rocket science.

The author of the blog post touts it as a _feature_ of MongoDB, but it's more accurate to say that it's an artifact of MongoDB's 4MB document size limit -- you simply cannot store large files in MongoDB without breaking them up. Sure, by splitting files into chunks you can parallelize loading them, but that's about the only advantage.

Among the key-value NoSQL databases, Cassandra and Riak are much better at storing large chunks of data -- neither has a specific limit on the size of objects. I have used both successfully to store assets such as JPEGs, and they are both extremely fast both on reads and on writes.

Neither is built for that purpose, and will load an entire object into memory instead of streaming it, so if you have lots of concurrent queries you will simply run out of memory at some point -- 10 clients each loading a 10MB image at the same time will have the database peak at 100MB at that moment.

Actually, Riak uses dangerously large amounts of memory when just saving a number of large files. I don't know if that's because of Erlang's garbage collector lagging behind, or what; I would be worried about swapping or running out of memory when running it in a production system.

link

mathias_10gen 5803 days ago

You actually list one of the advantages of GridFS right there in your post: streaming. If you are serving up a 700MB video, you don't want to have to load the whole thing into memory or push the whole thing to the app server before you can start streaming. Since we break the files into chunks, you can start sending data as soon as the first chunk (256k by default) is loaded, and only need to have a little bit in ram at any given moment. (Although obviously the more you have in ram, the faster you will be able to serve files)

link

kchodorow 5803 days ago

GridFS is simple (and probably could be implemented with most DBs) but it was designed to have some nice properties. Notably, you don't have to load the whole file into memory: you can stream it back to the client. You can also easily get random access to the file, not just sequential.

link