Hacker News new | ask | show | jobs
by exabrial 3673 days ago
Bad ideas from 5, 10, er 20, er 30 years ago are stil bad ideas.

I know the HN police will cite me for no citation, so I'd say it comes with experience. The law changed at one point, and we were legally bound to be able to locate a customers record by a piece of data in a blob. The only way to fix the problem was to dump the massive (1tb+) table and reinsert them into a real schema. The engineering effort to do this took 9 months to get it right, because other people changed the way blobs were written out over the course of years.

Being clever doesnt pay, again.

5 comments

> Bad ideas from 5, 10, er 20, er 30 years ago are stil bad ideas.

I agree 100%.

I read that title and expected the post to begin with "Never. You should never serialize objects into a single field." I was disappointed.

If you need schemaless storage, use a schemaless DB. I don't understand what's so difficult about that. I wouldn't try to shove unstructured data into PostgreSQL any more than I'd try to shove relational data into MongoDB.

HAHA, I tend to agree that 'never' is the only good answer.
It seems the article was focused purely on the performance implications of the decision, and nothing about the maintenance and architectural impact.
When my previous employer forced my team to implement storage of structured data as a serialized BLOB (on top of a system which used to store the data the "right" way), I turned in my resignation.

Background: we had been storing other, similar data in a structured way for years, so we had a system set up to do it right. I'm not sure what the rationale was for switching, but it was declared by fiat over the protests of a team of five experienced .NET developers and an experienced lead DBA. We began experiencing problems from it before we were even a month into the project, such as serialization output not agreeing between client apps (.NET serialization is NOT designed to be a shared archive format!!), implementation requiring breaking the separation of concerns between layers of our application, etc. And for what? When asked what we would do when the format changes, management cheerfully replied "oh we'll just write + run a conversion EXE to update the data in bulk. Why, we do that all the time in [other engineering team who cowboy-codes everything and operates with a level of technical debt that makes it suck to work on that codebase]."

Of course this wasn't the only reason I was resigning, but it made the decision easier!

There are so many ways to accomplish this other than what you did. I assume there was a reason for your choice, but it would have been so much easier if you could just create a separate table that matched the same primary key as the customer record, and contained a single other field, the data required to be searchable. Easy to join and search, easy to insert and update.

> reinsert them into a real schema

That sounds sort of like you decided to fix a bunch of problems at the same time...

There are so many ways to accomplish this other than what you did. [...] That sounds sort of like you decided to fix a bunch of problems at the same time

A real database with real schema involves a series of guaranteed logical relations. Each violation of these logical relations tends to result in a different kind of problem (if you have a relation requiring pairing cars and drivers, you could have the problem of cars without drivers and drivers without cars. The potential logical problems multiply as the effective schema grows, without you have explicit schema or not).

So basically the move of using a real schema fixes a wide variety of real and potential problems compared to ad-hoc solutions. There are many ad-hoc solutions but since these aren't guaranteeing logical relations, such solutions tend to have holes the appear later.

So the gp may have been forced to use a real schema based on the multiplication of problems or they may have just done it because it was the right thing.

Every decision is a trade-off between up-front work and later eventual work. Given "The law changed at one point, and we were legally bound to be able to locate a customers record by a piece of data in a blob.", just a few of the possible solutions might include:

- Move to a fully relational schema:

- - Pros: You can leverage the power of the database to efficiently index, search, and aggregate any specific column without too much trouble.

- - Cons: Schema change may be expensive. Requires processing all data into a new structure. Requires massive changes to large swaths of your data handling routines. Must be done all at once, or you need to run both systems in parallel for a while during conversion. In this case, took 9+ months.

- Add a field on that table:

- - Pros: Simple. Requires very little change in processing routines. Efficient. Can rely on DB typing to require data be present for a record.

- - Cons: Depending on database may require excessive downtime while schema is updated.

- Add a table that tracks this field and links to the main record.

- - Pros: Simple. Requires relatively little change in processing routines. Efficient. Zero downtime. Can fill old records with background process.

- - Cons: Hard to enforce that the data exists for every record. Can be mitigated with report generated for records without this link.

Now, given those choices, I would say the correct choice in any particular situation depends quite a bit on external constraints. Do you have 9+ months and the free developer time to essentially redesign what may be large swaths of your back-end? Do you have assurance (tests, language features, etc) that you won't increase enough bugs to negate the benefits of moving to a fully relational schema? Is there a looming deadline on when the works needs to be completed by? Is downtime not really an option, and/or is building a parallel data store for the migration not feasible? How useful is normalizing the data expected to be in the short, medium and long term?

Like I said, I assume they had a reason for the full relational migration. I was just pointing out that there are easier solutions that in some cases fit business needs better than that. If your company goes under because your large migration project stole focus and manpower from other needed projects when there were simpler solutions available, then you made the wrong decision, period.

Broadly, software is about tradeoffs.

But the relational model is essentially so good an approach that if you are building something that's close to a relational model but not fully relational, you're almost certainly creating more present and future potential problems than if you moved to fully relational model (edit: and that's not saying other approaches are bad if you have a problem a ways away from the relational domain).

Neither of us know the particulars of gp's problem and I so couldn't say if you're alternatives are better tradeoffs. However, I would wager if someone wound-up adding several fields and tables to an existing blob-filled database, the result would haunt it's creator on many lonely nights at the office.

actually we save history data as json inside postgresql and article data as json and we have a price table that adds a history as a postgresql trigger. that's actually not blob data but its a kind of serialization. However we access the data regulary. The Price History is exposed to the user so it needs to work. Our system needs to work even against older versions of the table.
If you use an hammer to kill a mosquito obviously it's a bad idea. The hammer is useful to put a nail in the wall. In my current job I introduced a configuration store based on serialization. Obviously, instead of storing everything in a blob, I created a table with some generic string id columns and a generic key that is an object serialized in JSON and a value that is the configuration object serialized in JSON. In this way I can have the best of both worlds, a generic configuration store that is also indexed on the string ids and searchable on the generic key.
The article is not talking about config files, which is one of the few valid reasons to do this, as with a config file you're almost always going to just want the whole thing once at initialization.

And even then, only if there's loads of config values. If you've only got 5 or 10, that solution is bad.

The article is implicitly talking about business objects.

I never mentioned config files. The article explicitly states: "A good example of this optional nature of data is user preferences – you only really need to store the settings that differ from the default values." User settings is a good candidate for a configuration object in the configuration store that I built. The primary id will be the user id, the secondary id the machine name (if the system needs to support different configurations on different machines) and the rest will be a CLOB containing the JSON serialized user configuration.
> and a value that is the configuration object serialized in JSON

It all depends on the data, but if it's not simple, then serialized JSON values would generally incur performance hit for search operations. Breaking out the data into separate columns could be better indexed.

No, there is no whatsoever performance hit because the identifiers are in separate columns in the same table, and they can be indexed normally. There will be a performance hit if for some future requirement all the ids column are exhausted, they start using the generic key with complex serialized objects AND they want the generic key to be searchable. At that point they can simply add another id column if it is really necessary.