Hacker News new | ask | show | jobs
by pyrophane 1860 days ago
I think the biggest mistake some startups make wrt their data model is not really thinking about it at all. The data model winds up being the byproduct of all the features they've implemented and the framework and the libraries they've used, rather than something that was deliberately designed.
6 comments

At the other end of the scale is a data model designed for extreme extensibility.

If you ever hear anyone bragging that their data model is entirely metadata driven, and can be used to model anything - without changing the database - that's a huge red flag, as is looking in and seeing tables called "element", "business object" and the like.

Unfortunately, for most serious Enterprise systems, a degree of flexibility is essential. It's being able to pick the right balance between hard coding first class domain objects into the database and allowing for extensibility that IMO marks the truly expert system designer.

One underlying reason for this is that DBMS systems have an unnecessary source of complexity: They have a separate Data Manipulation Language (DML) and a Data Description Language (DDL). They really ought to be unified, but few (any?) mainstream SQL databases are homoiconic in this way.

E.g.: It should be possible to take a query definition, request its columns ("schema only" execution), and then insert or merge the columns into a table definition somewhere. Something like:

    SELECT SCHEMA( SELECT * FROM "blah" ) 
    INTO "tablename"
When a black market is formed, it's a sign that there is an unmet demand. When you see the exact same "wrong" design pattern turn up over and over, it's a sign that the underlying system isn't meeting the needs of the developers.
it sounds like you're making an interesting point but I'm afraid I don't follow, could you elaborate?
The "bad" design that keeps cropping up over and over is the second system effect: Tables, relationships, and columns defined as data in a few simple tables, instead of being defined explicitly in the SQL schema as expected. This is less than optimal for lots of reasons: duplication of metadata, inefficient query plans, no foreign keys, inability to use most kinds of indexes effectively, etc...

However, the need is real: the ability to easily extend or generate table schemas without having to switch languages. You can argue that this is "not that hard", but you'd be absolutely wrong. It's obscenely difficult. I've tried, failed, and have given up. One use-case I had was automatically generating tables for importing data from PowerShell. My goal was to be able to write something like this:

    Get-Process | Export-Sql -ConnectionString '...' -TableName 'Processes'
And have the "Export-Sql" command automatically generate the table schema on the fly based on the input columns it sees. I even wanted to be able to represent object hierarchies as sets of related tables with parent-child foreign key relationships automatically put in. I got a proof-of-concept working, but there were just too many edge-cases. Things like maximum lengths for key or index columns, maximum row size, inability to switch column types on the fly, etc...

So what to do other people do? They also give up and resort to defining a single table that has the columns: "TableName, RowId, ColumnName, ColumnValue" and call it a day. I mean... what other options are there? Bang your head against the wall for months trying to deal with idiotic things like identifier length limits? Correctly escaping arbitrary input strings? Generating hundreds of distinct commands that cannot be parametrised and hope you don't have a SQL injection vulnerability lurking in there somewhere? It's nuts!

Have you seen just how much code it takes to take an arbitrary table, one with dozens of foreign key references, several filtered multi-column indexes, and views that depend on it, and then insert a column in a specific position? You have to drop everything, copy the table, rename, and then recreate everything. Doing this in code would be... I dunno... a few hundred thousand lines? That's absurd. You're not doing it. I'm not doing it. Microsoft did it once for SQL Server Management Studio, and I bet they're not rewriting that code in a hurry!

But go back to the "bad" schema example before: Is it really that bad? What if the database engine had the ability to store common prefixes just once, instead of repeating them for each row? What if the schema looked like this:

    Database, Owner, Table, Column, Value
Starting to look familiar? A bit like [server].[database].[dbo].[Table] perhaps, familiar to every user of Microsoft SQL Server?

There's a natural hierarchy for describing the data, that lends itself well to being represented as a B-Tree, along with the data itself! That's what the "bad" schema is doing: it's representing the data with the most natural representation! It's not wrong at all!

The problem is that database engines have all sorts of features and optimisations that we want that isn't directly compatible with the naive implementation of the bad schema. Constraints, foreign keys, efficient storage, indexing, etc...

However, none of these features are fundamentally incompatible with an API that merely "pretends" that the data is stored in a single flat list that can have its schema updated with a simple insert. The database engine could simply factor out the schema part and the data part into different physical storage layouts, with all the usual efficiencies such as not having to repeat column names for every row.

To summarise: we could have our cake and eat it too. Database engines could be developed that use ordinary select/insert/delete/update statements to modify the schema, and have it perform just as efficiently as a database that uses clumsy statements like "alter table add column". In this world, a database schema upgrade could be as simple as a single literal "MERGE" statement!

While schema creation SQL can be a be a bit unwieldy, I'm not sure I appreciate which part is the problem or what you're trying to achieve.

Obviously you can SELECT * INTO FROM .. if you're just temporarily inserting data.

I'm not sure I see the value in automatically importing arbitrary data into a schemad database object. I think it's too complicated to be carried out by the database and should probably be done by some other piece of software and under human supervision/guidance.

Are you suggesting that INSERT/UPDATE statements also have the ability to modify objects? That sounds like it would add complexity without much gain as opposed to just running and alter table query.

I'm not saying that the process of writing schema modifying queries is painless but I'm not sure I'm convinced that we can have the cake and eat it

For your example I personally would've AWKED into a predefined schema or inserted the data as JSON

> Obviously you can SELECT * INTO FROM .. if you're just temporarily inserting data

Okay, bad example. I should have specified: Insert into an existing table, add the missing columns automatically. There are some special-cases where most database engines can automatically generate schemas, and "SELECT... INTO" is one of those few. Usually only allowed with an empty destination table.

> I'm not sure I see the value in automatically importing arbitrary data into a schemad database object.

There are lots of use-cases for this. You might not have them, but other people do all the time. Just about any large-scale enterprise software needs to be extensible in the field, for example. Think the likes of SAP or Siebel. I've seen similar problems crop up in CMDBs, job and ticket management software, etc... Famously, Jira is slow precisely because it is so bad at handling this kind of extensibility efficiently.

> Are you suggesting that INSERT/UPDATE statements also have the ability to modify objects?

Not necessarily, although that could be an option. What I mean is that changes to the table schema should be made using insert/update/delete statements, where the only "data" is things such as the column names, types, constraints, etc...

> That sounds like it would add complexity without much gain as opposed to just running and alter table query.

It would dramatically reduce complexity, removing an entire language from database engines, along with all the associated vendor-specific syntax, quirks, and limitations.

In fact, the same "schema" that is used in the wire protocol could be directly equal to the actual schema, and its update language. So if you get back a query result (with data), and want to create a copy of that schema elsewhere (e.g.: a local cache database), then you just take the "header" from the result set and "insert" it into the destination database schema. No conversion, no escaping, nothing.

Seriously: Try this as an exercise. Use Java or C#, write a select statement from a query (that has join, views, etc...), and then write the code that generates a table to cache this data in a separate, local database instance.

Do it. Sit down and give it a good go. You won't appreciate how hard this is until you do!

Now write the code to update the cache table dynamically if the source table changes. Assume that table has a petabyte of data. (That's why it's a table, it's a cache for something that's far too big to fit in memory!)

Now write the code to do all of the above with foreign constraints.

Good luck!

> I'm not sure I'm convinced that we can have the cake and eat it

There is a reason NoSQL databases exist and are wildly popular. It's not that "schemaless" is truly better so often, it's more that modifying the schema in most DBMS offerings is so fiddly that it's essentially impossible to do programmatically.

> or inserted the data as JSON

NoSQL in a nutshell! You haven't solved the problem, you've given up and resorted to schemaless tables instead, exactly the "bad" example above with all of its limitations and issues.

I've seen some good attempts at solving this problem, but they barely scratch the surface.

Secret hacker pro-tip: inner platforms and key/value pair representations do not actually improve extensibility. You always have a schema -- you get to decide whether it's explicit or implicit. Their problem isn't that their data model is too extensible: it's that it's plain old bad.
I think it’s a mistake that they don’t revisit it occasionally, and if necessary pull the trigger on a new schema + migration scripts.

Some early mistakes just can’t be solved without a do-over, and from a recent experience, it ends up being less work than maintaining a flawed schema.

This is the place I work at. The data model was designed with a narrow focus. When that turned out to not be viable, the company moved into an adjacent and much larger market. But the names never changed, and the subtle differences between the two worlds was never addressed. So now our application is full of terminology and restrictions that confuse our customers, and our database doesn’t match anyone’s mental model of what the application does. It’s all workable, but IMO we’ve paid (and pay) a not-insignificant price in productivity and complexity because we never took the time to fix these things.

At this point a ground-up rebuild is probably going to be no slower than trying to update the existing app. Neither will be cheap.

Whatever you will do, make sure to have a plan to deliver it peace-meal in successive releases. I am currently in the process of reworking the data model of a moderately complex application, and it has proven to be very risky and to take an insanely long time to do it all at once.
I hear ya. Both would probably cost the same, rebuild probably more, today, but it’s still cheaper in the long run. Unless the business goes back to what it was, it will keep diverging from the current terminology. No manager wants to hear it, but taking a 3-6 month breather to address tech debt like this is worth its weight in gold.
Practically speaking the data model creates very little value. If your startup is trying to make money, features are more important than design for a good stretch.

There comes a time to refactor and fix your architecture but it's usually not at the beginning.

You can design a data model if you don't know what you're building. And no startup really knows what they're building.

> Practically speaking the data model creates very little value.

That can be said about any cost centre, but you don’t have to drag managers kicking and screaming to get them to buy fire insurance.

Practically what it does is allow the company to keep up velocity and not be distracted putting out fires everywhere.

Of course building features is the team’s entire reason for existing. But there is no advantage to defer refactoring to some later date. The longer you wait the more painful it gets.

Chances are the time never comes, once progress stalls and the company isn’t out of business yet someone will have the brilliant idea to rewrite everything from scratch, which is just lighting money on fire with extra steps.

I worked at a startup that rewrote major parts of their product three times before the first Series A check cleared (while I worked there at least, I think they had another rewrite before my time). The company is alive and well because they didn't waste time and money on pondering architecture to solve a very difficult class of problems when they didn't know which problems were worth money yet.

They did do some smart things working around such known-unknowns, like operate as a consultancy for several years while building out the tech stack that would ultimately become the product catalog. That way they didn't have enormous risk associated with rewrites since all users were internal and zero projects actually needed feature or ABI with the stack.

The problems they had when I left were obvious, but the data model wasn't one of them. I'll stand by what I said above (ignoring the typo) - you can't specify a data model for a problem you don't know. And no startup really knows what problems they are going to solve when they start.

"The next series is the time to fix your mistakes from this one".

No matter what, startups break as they grow. You will need to fix things. Just make sure they're not sooo bad that you can't do it in a timely/affordable way.

Oddly, I feel the opposite is also a trap. A carefully crafted data model often stalls out compared to a grown one.
I'm not aware of a single project, ever, that has gotten their data model right up front and not had to iterate on it countless times as it grew/evolved. Except maybe NASA. Even the best early data models fail after years of updates and evolution.
This is the rationalisation I get every time when I tell companies that their data model is a mess. Never mind that neither I nor the parent said anything about doing it up front.

Of course they have to iterate, the problem is that there is no deliberate effort anywhere, it’s just piling more crap on top of old crap and deluding themselves that they are some kind of lean, agile visionaries because of it.

No one gets it right, and it's just grandstanding to pretend that "deliberate effort" is the distinguishing difference between good or bad data models. Unless you're dealing with highly specialized technical scenarios, most software is written to solve ambiguous and nebulous business problems that even the business doesn't necessarily understand.
> most software is written to solve ambiguous and nebulous business problems that even the business doesn't necessarily understand

Obviously if you're not deliberate about what problem you're trying to solve in the first place, no amount of deliberate effort will produce a good data model. Designing business processes must be done with the same deliberate effort, and also need to be constantly refactored.

Have you worked at a start-up? Or in a field new to you, or new altogether? No amount of well-intended deliberate design will make up for domain expertise. Ever. The best you can do is mitigate as much future refactoring as possible, but eve than that's not always the best use of time.

> Designing business processes must be done with the same deliberate effort

That's not always realistic. Businesses processes solve existing problems, sometimes tackle new problems, but just like data models hindsight is 20/20. And just like data models, business processes are constantly evolving and never done.

> Have you worked at a start-up? Or in a field new to you, or new altogether?

Yes, to both questions. And I still stand by everything I said.

In fact both cases made me form this exact opinion. Being all hand-wavy and haphazard about our business processes is precisely and without a doubt what killed it. The second example was from a big regulatory compliance thing that had taken the particular industry by surprise, again because nobody in that industry can be bothered to understand their own business processes, but cargo cult everything. Managers and domain experts would throw PowerPoint mockups over the fence and tell the devs to "just make it clickable and put it on the web site". I had to constantly fight them to get them to take their own jobs seriously. I nearly burnt out and had to quit shortly after, but we actually released on time and though not perfect, came out miles ahead of the competition.

> And just like data models, business processes are constantly evolving and never done.

That makes deliberate design of business process more important, not less. A culture of hand-waviness is precisely why businesses are still caught by surprise by things like regulations that they are given years in advance to implement. If you don't know where you are, you don't know what you're supposed to pivot to.

Yes absolutely. From day 0 everyone should already know how to do data modeling and perform migrations. If you don't and call yourself and engineer, you have no business starting a company.
I think this is largely a consequence of microservices. What is the "data model" here? You're thinking database, to a microservice that's a repository implementation detail.
I don't think it's a consequence of microservices, I think the problem is as old as the programmable computer. I'm the old paper and even clay tablet processes had to be constantly refactored too.

And a common argument I hear from microservice fanboys is that their old monoliths were suffering from high coupling and low cohesion and that microservices help with that. But I don't see why they couldn't have just refactored their monolith to address the problems they were having.

I hope this is sarcasm poking fun at why microservices are terrible and small companies should stop LARPing FANG and avoid them?
I’m not saying micro services are terrible at all. But I think the low coupling leads to a lot of “problems” when you try and re-aggregate data for analytics purposes when fundamentally micro services are meant to enable teams to work on them without worrying about how some other micro service is storing its data. It’s a trade off you make.
> fundamentally micro services are meant to enable teams to work on them without worrying about how some other micro service is storing its data

What exactly is it that microservices bring to the table to achieve that, that doesn't already exist out of the box in every single language in common business use?

My rule is "make it easy for us to fix our mistakes".

Even when we've spent a bunch of time planning out data, but we still got a lot of things wrong in hindsight. The reality is we didn't know enough about our product direction to make any truly informed decisions.

In general, poor decisions seem to stem from working in ambiguity about product, rather than poor technical decisions.

> we didn't know enough about our product direction to make any truly informed decisions.

Bingo. Very few programmers are working on projects that they have many years of domain expertise in, so the data models we come up with are always going to be limited to our experience in the here and now. It's one thing to organize highly technical code around things like graph algorithms are combining multiple b-trees in a single operation to lookup data, it's a whole different thing to tackle line-of-business problems where things are not so well defined (even to the in-house domain experts).

Yes, do migrations well to make it easy to fix mistakes. Don't make vague stupid data models to try to avoid making mistakes by being noncommittal.
This is a direct result of "move fast and break things", your database schema is the first thing broken and it never recovers.

It's baffling to me that for many companies I've worked for, their data model is basically 100% tech debt that can never be fixed because the cost is too high.