Hacker News new | ask | show | jobs
by commandlinefan 2102 days ago
Some people choose nosql alternatives because they've spent time analyzing the performance of a proper relational model and have determined that an RDBMS will generate too much overhead for their data load and consciously accept the tradeoffs involved in giving up automated referential integrity.

Most people, though, choose nosql alternatives because they're too lazy to learn how to model data.

8 comments

I am forever grateful that I took a full semester of database design in my undergrad. This single skill has stood with me for my entire career so far and has enabled me to figure out the root cause of many production issues. Plus people really like it when you can answer ad-hoc questions like "what inspections are still open and when were they first opened".

If y'all can understand Angular / React / Vue there's no reason to not learn databases.

Could you suggest resources (books, articles, videos, moocs or others) to learn good database design. I am picking up skills about sql but want to better understand and learn about databases. As someone who doesnt have that background, a lot of the times I am just googling for stuff and just trying out bits and pieces.
Apparently there's a 4th edition coming, but if you like the form-factor of a book-length text I'd throw a recommendation to "Database Design for Mere Mortals"[1]. I read the 1st edition from the late 90's, but I'd imagine it's still just as good. It approaches database modeling from a practical non-technical perspective, and I found it helped me learn data modeling in a software-agnostic manner, and later to influence how I talked to non-technical audiences about data modeling. I'm really glad to have found it early in my career.

I will echo what others have said. Data modeling is a force-multiplier type of skill. Combine it with a reasonable understanding of SQL and you can return a lot of value very quickly.

[1] https://www.amazon.com/Database-Design-Mere-Mortals-Hands/dp...

Not required, but knowing a little relational algebra helps queries make a lot more sense. A lot of the features like constraints and indices are very thin veneer over the underlying concepts.

At the other end of the spectrum, once you can write a few basic queries check out something like SQL Murder Mystery https://mystery.knightlab.com/

This one is super-fun and lets you practice some basic to low-intermediate skills.

Fun fact: You can do relational algebra on plain text data (e.g., csv, as well as output of commands like ls, etc.) using only shell + piping + well-known shell tools.
The Art of Postgresql. As the title suggests, it's targeted for Postgresql. However it has a full section on data modeling, another section on data types. It's packed with lots of examples and reasoning. I feel it gives you a good mix of theoretical background and hands on immediate experience. You'll find your way around what is (in my worthless opinion) the best database engine, and the pgsql specific examples will mostly translate to other relational databases.

https://theartofpostgresql.com/

I'd like to know this as well. I think you'll just have to build things (potentially horribly) and fail.

I took three semesters of database (granted, baby database classes) and I still have no idea how you can do something pretty straightforward like creating a room reservation system.

If there is a reservation beginning at 10:15 AM and ending at 12:30 PM and someone tries to book a reservation from 10:00 AM to 10:30 AM, the transaction should fail.

and before someone screams db2! yes, db2 can. but then you'd have to use db2 https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/co...

Why is this so difficult...

The documentation for Postgresql range types describes how to do exactly this.

https://www.postgresql.org/docs/11/rangetypes.html#RANGETYPE...

Edit: and if you didn't want to use postgres, you could have "starttime" and "endtime" columns and reject any bad bookings with a before insert / before update trigger.

This approach is the best and works really well if you don't need to do a join on a related table to look up information. If you need to use data outside of the current table for exclusions/check constraints, you have to write a trigger function (as far as I know).

I had to solve this recently, where the actual start/end times were stored on a related table. I'm no SQL wizard, but I'd love to share my solution in case it helps others (it might be terrible).

Note: I changed the actual tables/domain to be generic, this is a poor example and it made more sense for my usecase, but this shows general approach.

  -- Let's pretend we have these tables (awful design, but for sake of example):
  -- room <-> reservation <-> reservation_info
  -- Where "reservation_info" has "start_time" and "end_time"
  CREATE FUNCTION check_for_overlapping_reservations()
    RETURNS trigger
    LANGUAGE plpgsql AS
  $$ BEGIN
    IF (
        -- Find the newly created reservation and join it with the info record to grab "start_time" and "end_time" for check below
        with this_reservation as (
              select * from reservation
              inner join reservation_info on reservation_info.id = reservation.reservation_info_id
              where reservation.id = NEW.id
          ), bookings_for_timerange as (
              -- Select every other reservation, where the reservation is happening in the same room
              select * from reservation as other_reservation, this_reservation
              inner join reservation_info as other_reservation_info
                on other_reservation_info.id = other_reservation.reservation_info_id
              where other_reservation.room_id = this_reservation.room_id AND
              -- And the timerange from start to end overlaps the newly created record
                  tstzrange(this_reservation.start_time, this_reservation.end_time) &&
                  tstzrange(other_reservation_info.start_time, other_reservation_info.end_time)
          -- Get a count of all the records, it should only be 1. If it's greater than one, there's overlap.
          select count(*) from bookings_for_timerange
        ) > 1
    THEN
        RAISE EXCEPTION 'Room is already reserved during this time period';
    END IF;
    RETURN NEW;
  END;$$;
TBH it's a bit more verbose and less performant than need be, but hey, if it works, rock on! But don't use 'select *' in production code.

Consider next time something like

    if exists (select from sometable t1
      join sometable t2 on 
        t1.resource_id = t2.resource_id 
        and t1.res_id <> t2.res_id
        and tstzrange(t1.start, t1.end) && tstzrange(t2.start, t2.end)
      where t1.res_id = new.res_id )
then ... raise exception
hold start & end time in normal DB columns, use business logic for your rule enforcement, have a query to identify rule violation. Maybe add constraints on your warehouse data but I probably wouldn't on the operational DB. I could be convinced to do so for the PoC because it might make sense, but what if you're managing thousands or millions of scheduled "resources" like a chain of libraries, or hotel rooms? What if you need to coordinate "leases"? DB-level is an OK place to start, but also one of the hardest to tease apart when you need to scale.

This is kind of the point of data modeling; you start with an idea, make it all pretty 3rd-normal form, define your projected indices and checks and constraints, then you mess it up a bit where it makes sense or you've had experience in the past.

YMMV

Exclude constraint on a GiST index?

https://stackoverflow.com/a/51247705

Just about anything written by C. J. Date would be worth your while in this regard. It will give you a solid grounding in the relational model, which you can then apply to any RDBMS. Here's a good list to start with:

- Database in Depth, O'Reily (2005)

- Relational Theory for Computer Professionals, O'Reilly (2013)

- SQL and Relational Theory, 3rd Ed, O'Reily (2015)

- Database Design and Relational Theory, 2nd Ed, Apress (2019)

If you can find a copy of this, I'd recommend it "Conceptual Schema and Relational Database Design: A Fact Oriented Approach" by Nijssen and Halpin. They presented a richer model (NIAM) for designing databases than the ER model by Codd/Date, though CJ Dates book is useful to. Both of these are out of print now, but anything by Halpin or Date would be a useful addition. To design databases relational theory and normal forms are the language you'd need to understand - they're not very hard though seems they aren't as widely taught any more.
Pardon my ignorance--is inspections some concept that relates to database management, or are you referring to a query like "select * from inspections where status = 'open';". Honestly asking.
I believe OP is just referring to the ability to run adhoc queries. Typically nosql solutions are built to be performt for common access patterns, at the expense of being difficult or impossible to query in unplanned ways. SQL DBs are very good in that regard.
Thanks! And yes, I agree. The ability to write ad hoc SQL is a great skill.
Totally. I’ve always admired SQL wizards and their ability to (quickly) produce the most optimal queries given the relatively limited syntax (compared to what you can do with frameworks like, for example, LINQ today).
Ah yes, inspections in this context means Safety Inspections, aka a business rule or thing. Industry is Retail / Grocery. Inspectors inspected food at stores on day one and Stores had 5 business days to “complete” the inspection. Internal audit wanted a list of “defaulted” inspections so they could go harass the store people to get their things done.
Absolutely agree. Learning data normalisation was up there with Typing 9 (touch-typing class in grade 9, not Hindley-Milner whatever) in terms of long-term usefulness.
I feel the same way. It’s been nearly 30 years since my undergrad database class and I used what I learned in that class yesterday mocking out a schema for a personal project. We didn’t touch any real dbms in the class but rather did most of the class work using relational algebra.
Is "inspections" a typo for "transactions" here?
I speak as someone who's worked for 30+ years on data modelling. Every time I encounter some mongo or other non-relational DB where the company jewels (the data) are stored with no documentation, no data model etc and stuff is just shoved into these stores willy nilly it makes me weep.

Start off with relational, if perf is a problem then look at denormalising, after that then consider other alternatives for special cases. But to see run-of-the-mill apps with no near future scalability issues jumping right into mongo et al from day one makes me want to run away.

> have determined that an RDBMS will generate too much overhead

I think the read/write overhead is mostly a function of schema design, rather than an intrinsic property of an RDBMS. Denormalized schemas have similar performance profiles to document-oriented storage.

Moreover, mainstream SQL databases like Postgres are getting better and better at indexing jsonb fields, indexing time series data with BRIN, rudimentary full-text search, offering a one-size-fits-all storage system that may not be the best at everything, but will be good enough to support a growing business to its next funding round.

for text search the rum index is quite robust. my guess is that rum (or something like it) will be introduced into the pg core soon. we index many terabytes of pdf files excellent performance.

   https://github.com/postgrespro/rum
also, postgrespro are behind the json/b indexing.
Those two aren't mutually exclusive.

If you don't know how to put in the hard work to get something like you want 'the right way' and someone keeps advertising that they have 'another right way' that can do all the same things, you can take the low road and by the time you see the trap you've walked into it's difficult to go back, or for some people even admit that mistakes were made.

There are all of these time horizons in Software Engineering where you externalize the costs of things to your successors. Several classes of bad decisions start to fall apart at 18 months, and unless those started at the same time you did, you might be able to grin and bear it until it's time for a new job anyway. I think there are classes of process or architecture missteps that might go 3 years (less confident of that time interval) before it comes off the rails, and for a lot of startups that could be the proximal reason you flame out, even though it's the symptoms that are more visible.

Or, like myself, they chose NoSQL because it was much better integrated into their development and production cloud infrastructure than any alternatives.
Example or it didn't happen
AWS appsync, firebase, etc. Both nudge you to use their proprietary nosql database through integrations.
I think relational databases have largely failed developers because they don't provide the features they actually need.

A common question that comes up is how to do zero-downtime schema changes. The answer is that there isn't one. A correct implementation would store each schema version in the database and when an application connects, it would specify which version it's speaking. The developer would supply a mapping on how to make vX data available to a vY program. But no relational database supports such a feature, so people are forced to tread carefully -- look at all the deployment software that exists to attempt to find changes with database migrations and treat them differently. Look at all the software people have written to even apply those migrations. It's staggering, all because in the 70s when these systems were designed, the thought of deploying your code multiple times a day was unheard of.

Another problem that comes up is transactional isolation. Most engineers, and even casual practitioners, "know" that transactions exist for cases where you want to perform multiple operations atomically. But very few of these people are running the transaction with an isolation level that provides those guarantees. They will write their program assuming that transactions are strictly serializable, but in fact they are using "read committed" or some other weak form of isolation that totally breaks their assumptions. Then the database gets into a weird state, and people are baffled as to how that could happen. The actual implementation is so different from the CS assumptions that it's not even something that crosses people's minds, and "read committed" behaving as "read committed" looks like a heisenbug in the rare case they actually notice what's going on. That's super bad.

These underlying problems have nothing to do with SQL or NoSQL, though. NoSQL can smooth over schema incompatibilities a bit (perhaps the schema specifies everything as a "field tag" instead of a name, so you can safely rename columns, or perhaps everything is "optional", so the application can detect that it's reading an old record when it's missing; you can also easily build your own versioning system on top because the data doesn't mean anything to the database engine itself), but you can still get yourself into a lot of trouble. The NoSQL databases also have a horrifying transactional cleanliness record. Postgres may be "read committed" by default but at least you can get real transactions if you ask for them; good luck ever getting them with some NoSQL databases.

I guess where I'm going with this is that database engines are focused on the wrong problems. The relational model is very good. But it's something you can bring yourself once you have a way to transactionally read and write keys with opaque values. You can also add indexing at the application layer, or triggers, or encryption, or auditing, or RBAC... whatever, it's just code. At the end of the day, picking a relational database just gets you a VERY opinionated set of defaults that is unlikely to be what your application needs and nearly impossible to change later... but the defaults are juuuuust good enough that nobody makes a real effort to change them. Meanwhile, we ignore the problems that actually plague developers; schema versioning, unusual data types (time series, large blobs), availability, replication, etc. The operational concerns have been ignored for decades, and it's slowing everyone down.

People are right to be looking for alternatives, even though we know that most of the alternatives have even worse problems. Someday, somewhere, someone will get it right.

Good post. I think it is also worth mentioning that the relational model has a few intrinsic performance issues and gotcha's that require some uncomfortable and creative wrangling.

Specifically, safely handling concurrent writes to different rows that are related to each other requires careful consideration, and when you have a join with clauses on both tables it is not possible to have perfect efficiency without multi-table indexes.

I think that these are not intuitive, but solvable. The troubles arise when an application doesn't consider these issues until after it hits scale (which there is sufficient concurrency and enough millions of rows for these problems to rear their heads).

All depends what you're doing, right? Is it a small stupid app with a short lifespan? Is it "just" a proof-of-concept? Is the company likely to survive long enough that technical debt matters? (oblig. HN/SV/startup comment) Is there a ridiculous imbalance between data ingestion volume vs. reads/retrieval? All good arguments for alternative/noSQL approaches.
Yes. And then you‘re app grows and now you have an undocumented NoSQL database that would better have been a RDMS...

There‘s nothing wrong with starting the proper way.

I am imagining an industry where half of developers don't know how to design a relational data model. Scary if true.
I'd be surprised if it were only half.