Hacker News new | ask | show | jobs
by SulphurSmell 1458 days ago
This article is informative. I have found that databases in general tend to be less sexy than the front-end apps...especially with the recent cohort of devs. As an old bastard, I would pass on one thing: Realize that any reasonably used database will likely outlast the applications leveraging it. This is especially true the bigger it gets, and the longer it stays in production. That said, if you are influencing the design of a database, imagine years later what someone looking at it might want to know if having to rip all the data out into some other store. Having migrated many legacy systems, I tend to sleep better when I know the data is well-structured and easy to normalize. In those cases, I really don't care so much about the apps. If I can sort out (haha) the data, I worry less about the new apps I need to design. I have been known to bury documentation into for-purpose tables...that way I know that info won't be lost. Export the schema regularly, version it, check it in somewhere. And, if you can, please, limit the use of anything that can hold a NULL. Not every RDBMS handles NULL the same way. Big old databases live a looooong time.
11 comments

"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious." -- Fred Brooks, The Mythical Man Month (1975)
This man has clearly never seen our database schema.

Show me either flowcharts and/or tables, it doesn’t matter, I’ll continue to be mystified.

our devs created smash hits such as

  NounFunction
  NounFunctionTemporal
  NounCollectionFunction

  FunctionNoun_Function
  FunctionNoun_FunctionItem

  Noun_Collection
  Noun_Collection_Function
  Noun_Collection_FunctionItem
i've taken out duplicating combinations where different teams wanted same things over time, but never looked up what others had done before and decided to spin their own...
Don’t forget

Noun_Collection_Function_20200406 Noun_Collection_Function_20200320_Backup

And

Noun_Collection_Function_ForJim

I would be having words with the DBA about this. If there is a DBA. No DBA I would hire would allow this silliness. At least not in the actual application schema.
> At least not in the actual application schema.

And that's how the scratchpad schema became mission-critical ;)

Hey, at least you don't store gender in the address table, where one person can have multiple addresses.

https://news.ycombinator.com/item?id=27842820

That will explain a lot probably about many things!
He said usually.
This is going on my wall. Thanks so much.
I got from a comment on HN a few months ago, so also thank the commenter that I don't remember.
> Realize that any reasonably used database will likely outlast the applications leveraging it.

I love this statement. It's true too, having seen a decades-old database that needed to be converted to Postgres. The old application was going to be thrown away, but the data was still relevant :).

About a decade ago I worked for an insurance company. It was an offshoot that was spun out of of another insurance company from another state, which itself was decades old. As best as I could infer from my vantage point, my expertise at the time, and the spare time I was willing to investigate the matter, the database schema and a good chunk of the core data tables were first created in the late-80s on a mainframe and had outlived 4 or 5 application rewrites and (at least) two SQL variant migrations. I'm hand-waving exact details because nobody from the original company or that time period was still around even prior to the corporate split and so there was nobody who could answer history questions in detail, but that's also a testament to how persistent data can be. There was one developer from the parent company they slapped with golden handcuffs who knew where most of the bodies were hid in that software stack that enabled decent productivity but even she was lacking a solid 15 years of first-hand experience of its inception. To the best of my knowledge that database is still in use today.

Databases in heavy use will not just outlast your application, they have a strong chance of outlasting your career and they very well may outlast you as a person.

Are you me? LOL
I think this is and will continue to be a common use case. I'm very thankful for these applications that the data was still stuck in a crusty old relational database for me to work on top of as I built a new application.

It's going to be interesting when this same problem occurs years from now when people are trying to reverse schemas from NoSQL databases or if they become difficult to extract.

The only sticking point is when business logic is put into stored procedures. On one hand if you're building an app on top of it, there's a temptation to extract and optimize that logic in your new back-end. On the other hand, it is kind of nice to even have it at all should the legacy app go poof.

>business logic is put into stored procedures

This is a double-edged sword. I have seen massive business logic baked into stored procedures...so much so, that the applications themselves are rather slim. If this stored procedure code is properly versioned and otherwise managed, this is not entirely bad. If the data model is sound, I don't worry that much...stored procs vs 100KLOC of Java code? I can tell you what is easier to migrate. The other side of it is that stored procedures can also serve as an API into the actual database. I built a system once (90's) where the developers never accessed the actual database tables directly. They always called stored procedures. The advantage here was that we could tune the snot out of the database without a code change/build/deploy. It also allowed the DBA some control over poorly written, runaway queries. YMMV. I think today I probably would try to keep the database as vanilla as possible.

> YMMV. I think today I probably would try to keep the database as vanilla as possible.

Sure, YMMV.

In any non trivial dataset a lot of fields are effectively computed. For example merged entries: in order to get correct and whole data, one needs to consult some merge mapping, which can be easy to forget and tricky to get right - you have more than one relational identifier. This is not strictly business logic, but rather data assembly logic.

Similarly, a value can easily be spread over multiple fields (and tables) and it is crucial from data integrity standpoint to always update them in tandem if applicable. Again, this is very easy to screw up in client code, because the hidden relationship can be non-obvious. On one hand this is business logic, on the other hand violating this implicit relationship will result in non-agreeing data. This occurs for example when data represents parallel states.

Effectively, stored code acts as some kind of gateway API stored concurrently with the data. Sure, some peculiarities can be implemented with functions and triggers, but IMO those are just different sides of the same coin. In the end really depends on the dataset and what it represents.

Database stored code is a tool. Used appropriately it solves problems, used inappropriately it causes problems. If I were to design a database today, I would too try and make do without stored code, but would not try to twist data model so that it fits the relational model of RDBMSes.

Also an old timer. I’ve gone from mostly complex app and no DB logic to mostly heavily optimised DB and lots of procs. They protect the DB, make carefully crafted interfaces available and allow changes to the implementation.

Except for eg infinitely scalable cloud data stores like Google’s, or for ML where it’s just massive data and you need a dumb-ish store of many GB of parquet.

>protect the DB

I share this sentiment. The apps will come and go, the real value is in the data. If the database can cover its ass, I am less concerned about ham-fisted developers randomly messing it up with ill-conceived DML. It's not that they are malicious...it just happens. I have seen devs that code in Erlang, Haskell and even Assembly...run in terror at doing some SQL. It's weird. Trust but verify. And hire persnickity, passionate DBAs.

> and allow changes to the implementation

From my perspective this is a negative. You don’t want your queries to change after deployment.

Procedures are also quite hard to type.

If I publish a stored proc as an API, it leaves me free (within reason) to alter the actual SQL supporting it. Like modern API design, as long as you don't remove functionality, then change it as required.
A colleague of mine has a very strong opinion that any candidate who mentions using stored procedure in an interview without immediately disparaging it is an immediate no-hire. I sometimes wonder what kind of experience formed that opinion. (I personally never worked at a place with significant use of stored procedures.)
Some asshole DBA probably hurt his feelings way back.

If you have DBAs with organizational power and their shit together, they tend to ask awkward questions and tell you the baby is ugly. It’s easy for developers to do stupid shit and blame solar flares or whatever, but stored procedures are the DBAs realm, and the DBA knows who gets blamed when it blows up.

Like anything, there’s ups and downs. But if you can commit to a DBMS platform for the life of the app, there are compelling reasons to use it, even if it pisses off the devs.

Your colleague is dumb. Stored procedures have advantages and disadvantages. For certain data processing operations it is much better to perform everything inside the database than transfer large amount of data back and forth to some external application.

But I have also seen organizations with the policy that any operation touching base tables should be encapsulated in a stored procedure. This makes development extremely cumbersome, especially if some DBA is gatekeeper for the stored procedures. Something like this might have burned your colleague.

The basic idea is the same as in the article: Data is more important that apps and stored procedures really are just app-code.

Stored procedures have the downside that they often work only in one vendor's database.

> Stored procedures have the downside that they often work only in one vendor's database.

Doesn't really matter, because no one is writing database-agnostic SQL (unless that's part of your product). Any non-trivial implementation is going to require the use of proprietary SQL.

The NULL issue is so true. We migrated a large database from Oracle to Postgres. It took 2 years. By far and away the biggest issue was rewriting queries to account for the (correct) way Postgres handles NULLs versus how Oracle does it.

Also, in my experience, the database is almost always the main cause of any performance issues. I would much rather hire someone who is very good at making the database perform well than making the front end perform well. If you are seeking to be a full stack developer, devote much more time to the database layer than anything else.

>the database is almost always the main cause of any performance issues

I would be careful with the term "cause". There is a symbiotic relationship between the application and the database. Or, if talking to a DBA...a database and its applications. Most databases can store any sets of arbitrary information...but how they are stored (read: structure) must take into account how the data is to be used. When the database designer can be told up-front (by the app dev team) considerations can be made to optimize performance along whatever vector is most desired (e.g. read speed, write speed, consistency, concurrency, etc). Most database performance issues result when these considerations are left out. Related: Just because a query works (ie. returns the right data) does not mean it's the best query.

As a corollary to this, there is the infamous:

Client/front end guys: "you need to fix this query, it takes 10 minutes to run and the front end is timing out"

DBA: "err, this query wants to return 50,000 rows"

C/FEG: "yes, and?"

DBA: "what in blue blazes is your UI, or a user, going to do with 50,000 rows?"

C/FEG: "oh - we hadn't thought about that ..."

Ok, story time. Worked on a near-real-time system that processed millions of rows an hour. Devs routinely did a "select " and pulled 100K rows into an array on a local client. And they sorted it there*. The DBAs never looked into these because a)no resource issue, b)no perf issues and (most relevant) c) no one complained. One day this silliness was identified, and a quick meeting with the devs and DBA resulted. SQL 101. As if by magic, the client app was suddenly 3X more performant. The devs were applauded and there was much happiness in user-land. Only the SVP knew the truth...we left the sun shining on the developers. It was a rough week.
It’s like. If the database doesn’t perform well, nothing else performs well either.

If your database is great, at least you have the option of a fast backend.

>Also, in my experience, the database is almost always the main cause of any performance issues

More generically, state stores are almost always bottlenecks (they tend to be harder to scale without some tradeoff)

> As an old bastard, I would pass on one thing: Realize that any reasonably used database will likely outlast the applications leveraging it.

I’ve been working with and on databases for a long, long time, and I’ve even written about things I think people should know about if they want to do this, yet I never came up with such great insight. This is so true it should be engraved somewhere. Hats off!

Thanks. Scar tissue sometimes breeds insight. In further conversation on this phenomenon, I would argue that "long lived databases" are not so as result of brilliant design. Rather, it happens because the database itself is neglected and largely misunderstood, and gets less investment. And they live on and on...managers come and go...no investment. And then, years later, some poor bastard is stuck with a hideous mess that can't go anywhere. Don't let this happen to you.
The article left out one of the most fundamantal topics of databases--clustering of data in storage is everything. Examples:

1. If you store data in rows it's quite fast to insert/update/delete individual rows. Moreover, it's easy to do it concurrently. However reads can be very slow because you read the entire table if you scan a single column. That's why OLAP databases use column storage.

2. If you sort insert data in the table, reading ranges based on the sort key(s) is very fast. On the other hand inserts may spray data over over the entire table, (eventually) forcing writes to all blocks, which is very slow. That's why many OLTP databases use heap (unsorted) row organization.

In small databases you don't notice the differences, but they become dominant as volume increases. I believe this fact alone explains a lot of the proliferation of DBMS types as enterprise datasest have grown larger.

Edit: minor clarification

I heard about Flywaydb today, which appears to be an open source database versioning tool. Pretty interesting! https://flywaydb.org/
Pretty open-source, until you need "premium" features like "rollback" :/ (headwall)
At multiple companies I worked for, they concluded the backwards schema changes are not worth the risks and testing overhead. It's usually quick enough to issue a hotfix.

With rollback, you always risk losing data - remember, you're doing it when something didn't go as you expected. What are the odds, the rollback will break something further?

These were all Postgres shops, so schema changes within a transaction - that can be rolled back safely if it fails in the middle.

Oh that's unfortunate. No thank you.
This is one reason that ORMs which wish to own the database schema make me uncomfortable. How much fun is that schema going to be years down the road when that ORM is out of fashion, but you still need an app working with that data? Some are better than others at doing things in a sane way, of course.
Doesn’t really matter though? Even if the ORM is changed, the actual schema is still in the database.

I’ve migrated ORM several times, and the only thing that changes is the entity definition. The database remains the same.

Or even worse, when the ORM is written in a programming language your organisation no longer uses and is part of codebase that is no longer under development.
> I have found that databases in general tend to be less sexy than the front-end apps

I don't know if there is a single soul who believes this. If you are designing a database, it is much more cooler than front end apps.

I think they are wonderful (from the Codd and Date days...) but mostly everyone else disagrees.
I agree. I have some kind of design hierarchy. Database -> Architecture -> Services for outside consumers -> Backend -> Frontend. Things coming first must be designed more thoroughly as they're likely to live longer. Proper database design is paramount. Spend as much time as necessary. Iterate before going live as long as necessary to ensure that design is sound. Because it's so much harder to change database later. Trivial changes often require huge efforts.
Rob Pike’s 5 rules:

https://users.ece.utexas.edu/~adnan/pike.html

Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

> And, if you can, please, limit the use of anything that can hold a NULL.

I'm curious: what's the alternative to NULL? I'm struggling to think of a database where NULL wouldn't be super useful. It feels like NULL as a concept is almost required, but I think you're suggesting that's a faulty assumption.

Would love to hear more about this.

The article probably means: Define anything as non-nullable which can be non-nullable. Unfortunately SQL defaults to nullable, so there is a tendency to define too many columns as nullable. Normalization can also reduce the need for nullable columns in base tables (but you will get them back if you perform an outer join, so it is not a panacea).

But if a columns truly has unknown values, NULL's are the best ways to represent it. It is sometimes suggested to use "sentinel values" like empty string or -1 to represent missing values, but IMHO this is much worse than NULL's, since these will be treated as regular values by operators. When you have missing values, you want three-valued logic.

Gotcha. That makes sense. Thank you!
not op, but ty.
Noticed I said "limit", and not "eliminate". The concept of NULLS in an RDBMS has been discussed and argued for decades. Three valued logic is generally not well understood, and as such, it's usually skipped over. Binary logic is easy...0 or 1. It's there, or not. OFF/ON. 3 valued logic introduces a third state: "unknown". It really means that there is no meaningful answer...not yet, anyway. In simple terms, NULL in RDBMS is dangerously often equated to 0 (zero) in numeric fields. Or "space" in character fields. Neither are true. On occasion, these might behave as such...but you are playing with fire here. What's worse, you can't compare a NULL to a NULL. NULL != NULL. Which is why you often see the "IS NULL" operator used in DML for such things. What it boils down to is that your applications need to pay careful attention when digging around (read: joining) tables with NULLS. Additional code logic is often required to ensure that things work the way you expect them to when NULLS are involved. Formal primary keys cannot NULL (this is enforced by the RDBMS) but it does not stop ad-hoc clever queries from including NULL columns as part of the "where..." clause. So what do? You can tell your DBA to ensure that all columns are NOT NULL. This really tightens things down, and makes some operations a bit more sane. However, if a column value is actually not known (yet!) then one is forced to populate it with data that may not be correct/relevant. These are often called "sentinal" values and can cause a mess of their own. There are use cases where a RDBMS schema with everything as NOT NULL can make sense. In my experience, databases whose data is never (directly) seen/input by actual people can work. When a human sees a field with "placeholder value" instead of just blank space..it is uncomfortable. My advice is to really understand why something might be NULL, and don't blindly add a mess of columns to a table as NULL because it's easy. Remember, that shit will live forever. Google around for "three valued logic" and start down the rabbit hole. Long-term (think: migrating from one RDBMS impl to another) you will absolutely find that NULLs don't behave the same. Various operations may or not be consistent from one to another...and this will break your apps. The key (haha) relationships modeled in your schema...if you strip all the unimportant stuff away...should avoid NULL. The flip side of this is to do a code scan (app side) and search for "is NULL" , "is NOT NULL" in the embedded SQL. Especially when there are a lot of "and ____ IS NOT NULL and ___IS NOT NULL" and so forth. This will indicate those parts of the database that are "hot spots" for NULL issues. I have seen SQL where 80% of the DML is taken up with NULL handling of some kind.
Ah, that makes a lot of sense. Thank you for the explanation!
I personally try to strive towards a database design as if the next person were to know my address and having anger and control issues.

Fixing up a database step by step is a painful process.