Hacker News new | ask | show | jobs
by orf 3676 days ago
I think it's rarely a good idea to store blobs of any kind in the database. I've seen systems that store pretty large files as blobs (even base64 encoded ones once), then do 'select *' on the table and wonder why their query performance is so terrible. Use a filesystem, that's what it's for.

For stuff like this then I would say it's always preferable to store a json encoded representation rather than a format like pickle (python's object serialization format). If you don't and some clever chap works out a way to write input to that field then you've got an easy RCE. Plus it's easier to debug JSON, and databases like PG have a native data type for it.

5 comments

it's always preferable to store a json encoded representation

That's just the format du jour. Ten years ago it would have been "store an XML encoded representation" and ten years before that it would have been some delimited representation. Tomorrow it may be yaml or something even more hideous.

Blobs in the DB can make sense in some situations but they should really be blobs: images, or other binary/raw data. But beyond a certain size it's almost always easier to manage things like these in a filesystem and just store a pointer (filename) in the DB.

I don't agree that it's easier to store them in filesystem. You'll have to deal with a lot of potential problems: consistency, backup, transactions, replication, corruption. Database solves those problems automatically and as long as you can store everything there, you better do that. Good databases are not that bad at storing blobs.
But there are database tools emerging to deal with JSON, like Postgres (and Mongodb.) I haven't used Mongo in production, but postgres's json support is fantastic and materially better than anything I have seen with XML. Sometimes you really do need to store nearly schemaless data and operate on it for some time.
As reference there are/were tools in something like MSSQL for dealing with XML.
mysql has had xml/xpath query functions for a while too, iirc.
> That's just the format du jour. Ten years ago it would have been "store an XML encoded representation" and ten years before that it would have been some delimited representation.

That's called progress. We can't store it in some future language, who wants to store it as XML, so let's use the best of what we have.

JSON is receiving pretty much the same hype that XML did 15 years ago. I never got really understood the hype about it then and I don't get it with JSON now. I find XML better for some things and JSON better for others. Neither is perfect.
Never use filenames in a database. There's a bottomless well of security vulnerabilities that spring forth when you decide to "just store a path to the asset".
Of course, you don't take for granted what the client says. You generate your own file name.
Select * is the problem, not the blobs. Storing blobs should be more efficient that putting them in a filesystem somewhere else (which is effectively just another database) and dealing with the overhead of a bunch of other filesystem operations and losing referential integrity, etc.
How do you figure that? In addition to making database performance less predictable and introducing all of the problems that BLOBs bring, you lose most of the benefits of the database in the process.

File systems are about storing files. Databases are about intelligently organizing data for retrieval and reliably delivering atomic transactions.

Any system that I've seen scale up well separated blob data to a traditional or object file system. In addition to scaling the database more effectively, this allowed the infrastructure teams to optimize delivery of blob data from a platform POV.

There are a few cases where it makes sense to store files in the database, but the constraints are fairly specific. The one time I did it to god effect was when all the files were fairly small (<40k), and one of the defining features of the system needed to be it's resilience. We were able to fold the file storage into the normal master/slave replication setup we were doing, which was a big reduction in complexity, compared to a separate replicating file store.
Sounds like you scoped it well. I see that as a similar use case to putting crypto keys or user pictures in LDAP.

Often folks doing this try to re-invent a content management system like FileNet in the DB.

In our case, it was for storage of electronically signed documents. Really it was an HTML template (the same displayed to them) with the inputs replaced with the values they presented, converted to PDF, and attached to the account. A few pages of PDFs like that doesn't take much room, and ensuring there isn't a mixup with files and accounts when it's for regulatory compliance makes it well worth any downsides.
It's easier to manage storage for file system objects than in-database objects. Things like performance (potentially on a per-file basis using symlinks), cost (likewise), out of band access (e.g. serving statically directly from web server and not bottlenecking on a DB connection), fragmentation, free space recovery on deletion, etc.
> and dealing with the overhead of a bunch of other filesystem operations and losing referential integrity, etc.

Yeah, better to instead deal with the overhead of the database combined with the overhead of the filesystem! Referential integrity is also very easy to deal with.

This! Please!

I work at a place that has been trying to undo the damage of having serialized BLOBs for about 3 years. Granted, these are especially nasty things with a custom serialization layer written in Java.

Software companies consist of two assets produced by it's employees: data and algorithms to make the data useful.

Why would you ever lock your data down and make it harder to make cool algorithms to make the data useful?

Plus using something language agnostic allows you to employ polyglot programming. If your data is all pickled (or Java serialized, or really anything else that's "native"), good luck using anything besides the original language.
> I think it's rarely a good idea to store blobs of any kind in the database.

Stop right there. There are plenty of examples for storing blobs in a datastore - images and videos are 2 prime examples.

I think what you should say is in a SQL database. Google's AppEngine with their datastore makes it dead easy and with high performance to store&retrieve things like images and videos.