Hacker News new | ask | show | jobs
by kbenson 3677 days ago
There are so many ways to accomplish this other than what you did. I assume there was a reason for your choice, but it would have been so much easier if you could just create a separate table that matched the same primary key as the customer record, and contained a single other field, the data required to be searchable. Easy to join and search, easy to insert and update.

> reinsert them into a real schema

That sounds sort of like you decided to fix a bunch of problems at the same time...

1 comments

There are so many ways to accomplish this other than what you did. [...] That sounds sort of like you decided to fix a bunch of problems at the same time

A real database with real schema involves a series of guaranteed logical relations. Each violation of these logical relations tends to result in a different kind of problem (if you have a relation requiring pairing cars and drivers, you could have the problem of cars without drivers and drivers without cars. The potential logical problems multiply as the effective schema grows, without you have explicit schema or not).

So basically the move of using a real schema fixes a wide variety of real and potential problems compared to ad-hoc solutions. There are many ad-hoc solutions but since these aren't guaranteeing logical relations, such solutions tend to have holes the appear later.

So the gp may have been forced to use a real schema based on the multiplication of problems or they may have just done it because it was the right thing.

Every decision is a trade-off between up-front work and later eventual work. Given "The law changed at one point, and we were legally bound to be able to locate a customers record by a piece of data in a blob.", just a few of the possible solutions might include:

- Move to a fully relational schema:

- - Pros: You can leverage the power of the database to efficiently index, search, and aggregate any specific column without too much trouble.

- - Cons: Schema change may be expensive. Requires processing all data into a new structure. Requires massive changes to large swaths of your data handling routines. Must be done all at once, or you need to run both systems in parallel for a while during conversion. In this case, took 9+ months.

- Add a field on that table:

- - Pros: Simple. Requires very little change in processing routines. Efficient. Can rely on DB typing to require data be present for a record.

- - Cons: Depending on database may require excessive downtime while schema is updated.

- Add a table that tracks this field and links to the main record.

- - Pros: Simple. Requires relatively little change in processing routines. Efficient. Zero downtime. Can fill old records with background process.

- - Cons: Hard to enforce that the data exists for every record. Can be mitigated with report generated for records without this link.

Now, given those choices, I would say the correct choice in any particular situation depends quite a bit on external constraints. Do you have 9+ months and the free developer time to essentially redesign what may be large swaths of your back-end? Do you have assurance (tests, language features, etc) that you won't increase enough bugs to negate the benefits of moving to a fully relational schema? Is there a looming deadline on when the works needs to be completed by? Is downtime not really an option, and/or is building a parallel data store for the migration not feasible? How useful is normalizing the data expected to be in the short, medium and long term?

Like I said, I assume they had a reason for the full relational migration. I was just pointing out that there are easier solutions that in some cases fit business needs better than that. If your company goes under because your large migration project stole focus and manpower from other needed projects when there were simpler solutions available, then you made the wrong decision, period.

Broadly, software is about tradeoffs.

But the relational model is essentially so good an approach that if you are building something that's close to a relational model but not fully relational, you're almost certainly creating more present and future potential problems than if you moved to fully relational model (edit: and that's not saying other approaches are bad if you have a problem a ways away from the relational domain).

Neither of us know the particulars of gp's problem and I so couldn't say if you're alternatives are better tradeoffs. However, I would wager if someone wound-up adding several fields and tables to an existing blob-filled database, the result would haunt it's creator on many lonely nights at the office.