Hacker News new | ask | show | jobs
When should you store serialized objects in the database? (2010) (percona.com)
54 points by harshasrinivas 3673 days ago
19 comments

Bad ideas from 5, 10, er 20, er 30 years ago are stil bad ideas.

I know the HN police will cite me for no citation, so I'd say it comes with experience. The law changed at one point, and we were legally bound to be able to locate a customers record by a piece of data in a blob. The only way to fix the problem was to dump the massive (1tb+) table and reinsert them into a real schema. The engineering effort to do this took 9 months to get it right, because other people changed the way blobs were written out over the course of years.

Being clever doesnt pay, again.

> Bad ideas from 5, 10, er 20, er 30 years ago are stil bad ideas.

I agree 100%.

I read that title and expected the post to begin with "Never. You should never serialize objects into a single field." I was disappointed.

If you need schemaless storage, use a schemaless DB. I don't understand what's so difficult about that. I wouldn't try to shove unstructured data into PostgreSQL any more than I'd try to shove relational data into MongoDB.

HAHA, I tend to agree that 'never' is the only good answer.
It seems the article was focused purely on the performance implications of the decision, and nothing about the maintenance and architectural impact.
When my previous employer forced my team to implement storage of structured data as a serialized BLOB (on top of a system which used to store the data the "right" way), I turned in my resignation.

Background: we had been storing other, similar data in a structured way for years, so we had a system set up to do it right. I'm not sure what the rationale was for switching, but it was declared by fiat over the protests of a team of five experienced .NET developers and an experienced lead DBA. We began experiencing problems from it before we were even a month into the project, such as serialization output not agreeing between client apps (.NET serialization is NOT designed to be a shared archive format!!), implementation requiring breaking the separation of concerns between layers of our application, etc. And for what? When asked what we would do when the format changes, management cheerfully replied "oh we'll just write + run a conversion EXE to update the data in bulk. Why, we do that all the time in [other engineering team who cowboy-codes everything and operates with a level of technical debt that makes it suck to work on that codebase]."

Of course this wasn't the only reason I was resigning, but it made the decision easier!

There are so many ways to accomplish this other than what you did. I assume there was a reason for your choice, but it would have been so much easier if you could just create a separate table that matched the same primary key as the customer record, and contained a single other field, the data required to be searchable. Easy to join and search, easy to insert and update.

> reinsert them into a real schema

That sounds sort of like you decided to fix a bunch of problems at the same time...

There are so many ways to accomplish this other than what you did. [...] That sounds sort of like you decided to fix a bunch of problems at the same time

A real database with real schema involves a series of guaranteed logical relations. Each violation of these logical relations tends to result in a different kind of problem (if you have a relation requiring pairing cars and drivers, you could have the problem of cars without drivers and drivers without cars. The potential logical problems multiply as the effective schema grows, without you have explicit schema or not).

So basically the move of using a real schema fixes a wide variety of real and potential problems compared to ad-hoc solutions. There are many ad-hoc solutions but since these aren't guaranteeing logical relations, such solutions tend to have holes the appear later.

So the gp may have been forced to use a real schema based on the multiplication of problems or they may have just done it because it was the right thing.

Every decision is a trade-off between up-front work and later eventual work. Given "The law changed at one point, and we were legally bound to be able to locate a customers record by a piece of data in a blob.", just a few of the possible solutions might include:

- Move to a fully relational schema:

- - Pros: You can leverage the power of the database to efficiently index, search, and aggregate any specific column without too much trouble.

- - Cons: Schema change may be expensive. Requires processing all data into a new structure. Requires massive changes to large swaths of your data handling routines. Must be done all at once, or you need to run both systems in parallel for a while during conversion. In this case, took 9+ months.

- Add a field on that table:

- - Pros: Simple. Requires very little change in processing routines. Efficient. Can rely on DB typing to require data be present for a record.

- - Cons: Depending on database may require excessive downtime while schema is updated.

- Add a table that tracks this field and links to the main record.

- - Pros: Simple. Requires relatively little change in processing routines. Efficient. Zero downtime. Can fill old records with background process.

- - Cons: Hard to enforce that the data exists for every record. Can be mitigated with report generated for records without this link.

Now, given those choices, I would say the correct choice in any particular situation depends quite a bit on external constraints. Do you have 9+ months and the free developer time to essentially redesign what may be large swaths of your back-end? Do you have assurance (tests, language features, etc) that you won't increase enough bugs to negate the benefits of moving to a fully relational schema? Is there a looming deadline on when the works needs to be completed by? Is downtime not really an option, and/or is building a parallel data store for the migration not feasible? How useful is normalizing the data expected to be in the short, medium and long term?

Like I said, I assume they had a reason for the full relational migration. I was just pointing out that there are easier solutions that in some cases fit business needs better than that. If your company goes under because your large migration project stole focus and manpower from other needed projects when there were simpler solutions available, then you made the wrong decision, period.

Broadly, software is about tradeoffs.

But the relational model is essentially so good an approach that if you are building something that's close to a relational model but not fully relational, you're almost certainly creating more present and future potential problems than if you moved to fully relational model (edit: and that's not saying other approaches are bad if you have a problem a ways away from the relational domain).

Neither of us know the particulars of gp's problem and I so couldn't say if you're alternatives are better tradeoffs. However, I would wager if someone wound-up adding several fields and tables to an existing blob-filled database, the result would haunt it's creator on many lonely nights at the office.

actually we save history data as json inside postgresql and article data as json and we have a price table that adds a history as a postgresql trigger. that's actually not blob data but its a kind of serialization. However we access the data regulary. The Price History is exposed to the user so it needs to work. Our system needs to work even against older versions of the table.
If you use an hammer to kill a mosquito obviously it's a bad idea. The hammer is useful to put a nail in the wall. In my current job I introduced a configuration store based on serialization. Obviously, instead of storing everything in a blob, I created a table with some generic string id columns and a generic key that is an object serialized in JSON and a value that is the configuration object serialized in JSON. In this way I can have the best of both worlds, a generic configuration store that is also indexed on the string ids and searchable on the generic key.
The article is not talking about config files, which is one of the few valid reasons to do this, as with a config file you're almost always going to just want the whole thing once at initialization.

And even then, only if there's loads of config values. If you've only got 5 or 10, that solution is bad.

The article is implicitly talking about business objects.

I never mentioned config files. The article explicitly states: "A good example of this optional nature of data is user preferences – you only really need to store the settings that differ from the default values." User settings is a good candidate for a configuration object in the configuration store that I built. The primary id will be the user id, the secondary id the machine name (if the system needs to support different configurations on different machines) and the rest will be a CLOB containing the JSON serialized user configuration.
> and a value that is the configuration object serialized in JSON

It all depends on the data, but if it's not simple, then serialized JSON values would generally incur performance hit for search operations. Breaking out the data into separate columns could be better indexed.

No, there is no whatsoever performance hit because the identifiers are in separate columns in the same table, and they can be indexed normally. There will be a performance hit if for some future requirement all the ids column are exhausted, they start using the generic key with complex serialized objects AND they want the generic key to be searchable. At that point they can simply add another id column if it is really necessary.
I think it's rarely a good idea to store blobs of any kind in the database. I've seen systems that store pretty large files as blobs (even base64 encoded ones once), then do 'select *' on the table and wonder why their query performance is so terrible. Use a filesystem, that's what it's for.

For stuff like this then I would say it's always preferable to store a json encoded representation rather than a format like pickle (python's object serialization format). If you don't and some clever chap works out a way to write input to that field then you've got an easy RCE. Plus it's easier to debug JSON, and databases like PG have a native data type for it.

it's always preferable to store a json encoded representation

That's just the format du jour. Ten years ago it would have been "store an XML encoded representation" and ten years before that it would have been some delimited representation. Tomorrow it may be yaml or something even more hideous.

Blobs in the DB can make sense in some situations but they should really be blobs: images, or other binary/raw data. But beyond a certain size it's almost always easier to manage things like these in a filesystem and just store a pointer (filename) in the DB.

I don't agree that it's easier to store them in filesystem. You'll have to deal with a lot of potential problems: consistency, backup, transactions, replication, corruption. Database solves those problems automatically and as long as you can store everything there, you better do that. Good databases are not that bad at storing blobs.
But there are database tools emerging to deal with JSON, like Postgres (and Mongodb.) I haven't used Mongo in production, but postgres's json support is fantastic and materially better than anything I have seen with XML. Sometimes you really do need to store nearly schemaless data and operate on it for some time.
As reference there are/were tools in something like MSSQL for dealing with XML.
mysql has had xml/xpath query functions for a while too, iirc.
> That's just the format du jour. Ten years ago it would have been "store an XML encoded representation" and ten years before that it would have been some delimited representation.

That's called progress. We can't store it in some future language, who wants to store it as XML, so let's use the best of what we have.

JSON is receiving pretty much the same hype that XML did 15 years ago. I never got really understood the hype about it then and I don't get it with JSON now. I find XML better for some things and JSON better for others. Neither is perfect.
Never use filenames in a database. There's a bottomless well of security vulnerabilities that spring forth when you decide to "just store a path to the asset".
Of course, you don't take for granted what the client says. You generate your own file name.
Select * is the problem, not the blobs. Storing blobs should be more efficient that putting them in a filesystem somewhere else (which is effectively just another database) and dealing with the overhead of a bunch of other filesystem operations and losing referential integrity, etc.
How do you figure that? In addition to making database performance less predictable and introducing all of the problems that BLOBs bring, you lose most of the benefits of the database in the process.

File systems are about storing files. Databases are about intelligently organizing data for retrieval and reliably delivering atomic transactions.

Any system that I've seen scale up well separated blob data to a traditional or object file system. In addition to scaling the database more effectively, this allowed the infrastructure teams to optimize delivery of blob data from a platform POV.

There are a few cases where it makes sense to store files in the database, but the constraints are fairly specific. The one time I did it to god effect was when all the files were fairly small (<40k), and one of the defining features of the system needed to be it's resilience. We were able to fold the file storage into the normal master/slave replication setup we were doing, which was a big reduction in complexity, compared to a separate replicating file store.
Sounds like you scoped it well. I see that as a similar use case to putting crypto keys or user pictures in LDAP.

Often folks doing this try to re-invent a content management system like FileNet in the DB.

In our case, it was for storage of electronically signed documents. Really it was an HTML template (the same displayed to them) with the inputs replaced with the values they presented, converted to PDF, and attached to the account. A few pages of PDFs like that doesn't take much room, and ensuring there isn't a mixup with files and accounts when it's for regulatory compliance makes it well worth any downsides.
It's easier to manage storage for file system objects than in-database objects. Things like performance (potentially on a per-file basis using symlinks), cost (likewise), out of band access (e.g. serving statically directly from web server and not bottlenecking on a DB connection), fragmentation, free space recovery on deletion, etc.
> and dealing with the overhead of a bunch of other filesystem operations and losing referential integrity, etc.

Yeah, better to instead deal with the overhead of the database combined with the overhead of the filesystem! Referential integrity is also very easy to deal with.

This! Please!

I work at a place that has been trying to undo the damage of having serialized BLOBs for about 3 years. Granted, these are especially nasty things with a custom serialization layer written in Java.

Software companies consist of two assets produced by it's employees: data and algorithms to make the data useful.

Why would you ever lock your data down and make it harder to make cool algorithms to make the data useful?

Plus using something language agnostic allows you to employ polyglot programming. If your data is all pickled (or Java serialized, or really anything else that's "native"), good luck using anything besides the original language.
> I think it's rarely a good idea to store blobs of any kind in the database.

Stop right there. There are plenty of examples for storing blobs in a datastore - images and videos are 2 prime examples.

I think what you should say is in a SQL database. Google's AppEngine with their datastore makes it dead easy and with high performance to store&retrieve things like images and videos.

I was on the DBA team at FB and I spent the better part of a year working on the deployment system for online schema change. It was a pain. Other companies have done quite a bit of work on this as well (Shift from Square, etc...).

Later on I joined Pinterest as their first MySQL DBA. They had copied the sharding system from FB, but instead of having a bunch of columns, they just stored a JSON blob. This saved them from learning how to perform schema change until I joined the company. This is a pretty incredible feature.

We have a new feature under development (which will be open sourced as part of Percona MySQL) which will allow column level compression with an optional predefined dictionary. During testing, this resulted in a 30% additional reduction in spaced consumed versus InnoDB page compression AND doubles our peak QPS at lower latency. This would not work well with many individuals columns, but kicks ass for JSON blobs.

http://www.slideshare.net/denshikarasu/less-is-more-novel-ap... (slides 37, 40, 41, 42)

For anybody using sqlite, they have good documentation about this very question: https://www.sqlite.org/intern-v-extern-blob.html
Is this really the same thing? This page compares storing blobs in sqlite vs in a separate file.

I think a better sqlite page about the concept of serializing things in your database is the fact that sqlite has json support.

https://www.sqlite.org/json1.html

That's with a loadable JSON1 extension, which one won't find e.g. in Android.

Though checking now for this, someone has packaged a later version of SQLite with this extension [1]. I wonder if there is any possible performance advantage when using a system provided SQLite vs. one installed with the application?

[1] https://github.com/requery/sqlite-android

Note to self: should have read the whole project description, there's a performance chapter answering this exact question.
It makes sense to store serialized data structures in the database when these conditions apply:

1. There are no use cases that would require you to SELECT on the fields in the serialized data structures.

2. You anticipate that the data structures are going to change frequently during development, so that turning them into relations is going to involve a lot of schema migrations.

Basically you give up the possibility of being able to SELECT on some of the data in return for being able to change its format rapidly and cheaply.

I worked on a project recently where this was helpful -- when I designed the database schema I didn't know the details of many of the data structures that were going to have to be stored there. From the use cases I could deduce the set of fields that would need to be SELECTed on, but the other fields were ill-defined. By storing them as blobs (actually as JSONB fields, since this was PostgreSQL) I could safely defer the decision about how to design these parts of the database, without incurring lots of schema migrations along the way.

Be very careful with this logic. I'm in the middle of cleaning up a project where the original developer did this calculus and stored a lot of metadata as JSON strings in text fields. Now, three years out, we have new reporting requirements that need to be able to filter on some of that metadata.

Experience has taught me that it's next to impossible to know what types of queries will be needed for the entire lifetime of a project. If you absolutely must store data like this, please leave a clear migration path for moving some or all of it into well structured tables when (not if) it becomes necessary.

I understand your concern. Perhaps if I give an example, it will be clear why I followed the approach I did.

The project is an industrial control system. Different kinds of item are processed by the system. Each kind of item needs image acquisition parameters, for example field of view. There's no use case that needs to select kinds of item whose image capture parameters specify a particular field of view, and it's clear that such a use case is very unlikely to arise — the image acquisition parameters are part of the internal specification of the system, not something that any end user is interested in.

At database design time it wasn't settled which model of camera was going to be used, and so it was far from settled which parameters were going to be needed. Using a JSONB field for the image acquisition parameters meant that I could leave the specification of these parameters to the camera programmer, without having to incur a series of database migrations as the correct set of parameters were worked out.

Even in your project, where new reporting requirements eventually arose, it doesn't mean the original decision was necessarily a bad one. Sometimes it makes sense to incur technical debt in order to bring a product to market in a timely fashion.

You can get away with it in Postgres. The app I work on stores phone numbers in a JSONB array in the following format:

    [{
      "tags": ["cell"],
      "number": "1231231234"
    }]
Here's a snippet demonstrating how you can do a lateral left join on the column to find the number tagged 'cell' in tags array:

    select * from mytable t
    left join lateral (
      select phone->'number' as cell_phone from
      jsonb_array_elements(t.phone_numbers) phone
      where phone->'tags' @> '["cell"]'
    ) p on true;
I am working on a system where the data is serialized using Python's pickle and stored in the database. Absolute nightmare for debugging as its basically unreadable.
Can't you open a Python shell and unpickle it?
Gave me an error the last time I tried to that.

Plus even if it did work, it involves logging into the server, activating my python virtualenv, pulling the data out via the python / Django shell an unpickling and printing it. As opposed to running a query on my local machine connecting to the database. When you are debugging a problem and just want to get an overview of what is happening, that is a hell of a lot of hassle.

Yes, maybe use a database that allows you the best of both worlds, a serialised blob that happens to be queryable and generally really high performance.
Queryable and even indexable! Hardly a trade-off at all, really.
I was working on a massive CRUD project - hundreds of end-user customisable textarea fields. Despite this sounding like a project perfect for a NoSQL database, the in-house team that would be maintaining it were MySQL experts, and didn't want to support MongoDB or anything like that.

So, yep, we stored everything as serialized objects in the database. We had a separate table for 'change events', and whenever someone changed the contents of one of the textareas we stored it in that table. A worker would eventually update the the serialized object, but in the meantime, we would load the serialized object and apply all of the changes that had happened since it was last updated. Basically, an Event Sourcing pattern.

So it was either that, or the EAV route, or the 'end user altering the database' route. The latter two options sounded even worse. Our solution worked out pretty well. Admittedly, it only had a few hundred concurrent users, and even the biggest document was never more than 100K.

So, it can work, but YMMV.

We serialized some XML as a backup to the data we were extracting (and properly modeling), given that the vendor was prone to changing the schema without proper notification, and that there were some data elements we weren't using at that time.

We also built the necessary tools to extract/re-process records easily, and the architecture worked well for us. As our needs or the schema changed we could easily accommodate those changes without undue effort.

It doesn't directly address the question, and I'm not sure that I'd use the same solution if the volume were predicted to be significantly higher, but in our case it worked beautifully. (Happily our volume was predictably within a known range, for reasons I won't go into.)

(2010)

Uber moved to a similar architecture in 2014-2015, ~5 years after article and the original Friendfeed post. Being able to operate MySQL predictably at scale is extremely valuable to high-growth companies, enough to tilt in the favor of unconventional schema choices versus less proven NoSQL alternatives.

https://eng.uber.com/schemaless-part-one/

https://eng.uber.com/schemaless-part-two/

https://eng.uber.com/schemaless-part-three/

In my experience, databases outlast the applications built on top of them, so it makes no sense to cut corners on the data modelling.

Except, of course, if the data only exists to support the application (some sort of buffer, cache, or session storage).

> databases outlast the applications built on top of them

sure but this needs to be balanced with performance UNTIL then

This is the consequence of the chosen programming language not being adapted to work with a relational database. The language is made to work with objects, and the access to DB is clumsy, trough library functions, which are deeply encapsulated, and the language and the DB are two different worlds. In SAP's ABAP language relational database access is integrated into the language. In SAP, when you create a database table, the structure of that table will be automatically available to any program as a structure datatype. So if you change a table definition, you will also change the data type used by the programs. Doing table changes is supported by a database tool that will automatically copy records from the old table to the new one if necessary. Its easy to find all references to a DB table, and recompile the sources. Its actually done automatically when both program and DB structure changes are deployed. Whatever change the developer does in the development system, that change will be automatically adjusted in the productive system on deployment. This makes any table structure change pretty easy, the development environment takes care of that. Wherever SAP applications use blobs to store data (for example HR payroll), those are the worst to develop with. Doing a non-simple change on a TB big table would surely cause disruption in a SAP system too, but other techniques are available for those cases.
"If the application really is schema-less and has a lot of optional parameters that do not appear in every record, serializing the data in one column can be a better idea than having many extra columns that are NULL."

Why not just use a document oriented database instead? Seems like a good use case for MongoDB for example: https://www.mongodb.com/compare/mongodb-mysql

Also the assumption is that you don't need to report on the data. If you need to report on the data then you might need to create index on those columns for performance which you can't do on a blob.
Agree, storing binaries in database is generally a bad idea. It would be a really miserable idea if it were being done without a sane persistence API. In the Python world, ZODB - http://www.zodb.org/en/latest/ is tightly coupled with the language but works reasonably well in practice. The storage layer is pluggable and https://pypi.python.org/pypi/RelStorage provides storing pickles in RDBMS.

ZODB is arguable a novel approach to persistence using Python. And certainly worth taking some time just to play with it -- the barrier of entry low, e.g. `pip install`. But for each positive there are negatives..

"You got it buddy: the large print giveth, and the small print taketh away"

From my personal experience, I completely agree.

For an academic project, I was calculating ~2mln RNA-RNA interactions from their sequences. Since this calculation stays a requirement for all further calculations, being the naïve kid I was, I started pickle'ing the results.

To feel like the cool kid, I wanted to involve a database somehow -- so after trying out a bunch of options, I finally settled for ZODB. As the project scaled up, soon the ZODB started being a big pain, because as I recall, it only allows a limited number of connections even in the read operations.

Lesson learned, though, it now resides as a lookup table in a PgSQL instance.

Link about FriendFeed (mentioned in the blog): http://web.archive.org/web/20100314211658/http://bret.appspo...
What about this reason? What if your program is pretty much entirely used from a JSON REST service? What if these JSON objects need to also be sent between machines? What if they need to be exportable to files sometimes?

Now imagine the same program also has an internal database where these JSON objects can be imported and used. Does it make sense that, when actually in use, these objects are relational and split between 10 complicated tables? Why should someone bother writing complex import/export conversion functions, maintaining them in the future, and having worse performance. Wouldn't it be much simpler, maintainable, and faster to just plop the JSON in the database?

Certainly, so long as your database speaks JSON fluently you can even have your cake and eat (some) of it too. PostgreSQL has good JSON support. Microsoft's SQL Server is "working on it". Then there are loads of JSON friendly document databases out there such as Couch{DB, base}, Cloudant, Mongo, Redis, etc and so forth.
So, my question after storing xml in a database and using their xml features to build indexes is this. Are you looking for a database or a search engine? what will you gain from a database if you are not using it's features, over saving to disk and building a query index?
What you would gain is the ease, simplicity, and stability of using your favorite sql engine. You get ACID transactions and the ability to add and remove from large lists using low memory, for free.
Replication is a big advantage that's hard to replicate to the same degree with ZFS or rsync.
Proper databases have JSON or other serialized field types. Even mysql 5.7 has some support for this. There is no reason you should hit a limit on one of your tables at a few hundred thousand records because you're an idiot and stored a ton of serialized data in mysql. It happens all the time though.
The answer is, of course, almost never.