Hacker News new | ask | show | jobs
by tiffanyh 1456 days ago
#1 thing you should know, RDBMS can solve pretty much every data storage/retrieval problem you have.

If you're choosing something other than an RDBMS - you should rethink why.

Because unless you're at massive scale (which still doesn't justify it), choosing something else is rarely the right decision.

16 comments

> RDBMS can solve pretty much every data storage/retrieval problem you have.

Except the most important problem: A pleasant API. Which is, no doubt, why 95% of those considering something other than an RDBMS are making such considerations.

RDBMS can have pleasant APIs. It is not a fundamental limitation. We have built layers upon layers upon layers of abstraction over popular RDBMSes to provide nice APIs and they work well enough. But those additional layers come with a lot of added complexity and undesirable dependencies that most would prefer to see live in the DBMS itself instead.

At least among the RDBMSes we've heard of, there does not seem to be much interest in improving the APIs at the service level to make them more compelling to use natively like alternative offerings outside of the relational space have done.

I've honestly never understood why people have such a distaste for SQL. SQL and Linux/Unix have been the biggest constants of my entire programming career to this point (20ish years). I always know I can count on them.
I love the data model of RDBMS / SQL. I hate SQL _the language_. Verbose, irregular, attempting to mimic English and thus making it unintuitive and next to impossible to remember. 80% of my lookups into reference documentation is about syntax for things I don't use that often.

Examples: GRANT TO, REVOKE FROM, DENY TO (oh yes, what's the difference between REVOKE and DENY? and did you know that you can REVOKE a DENY?), arbitrary requirements for punctuation (e.g., = in BACKUP DATABASE AdventureWorks2012 TO DISK = 'X:\SQLServerBackups\AdventureWorks1.bak'), dubious context-sensitive "magical identifiers" (e.g., SELECT DATEPART(year, @t)), non-composability (how do you dynamically give a part specification to DATEPART?), etc, etc, etc.

It's possible to write a novel about just how unstructured and irregular "structured" query language is.

All of your examples are vendor specific extensions to SQL. Nothing of that is part of the standard. So you are effectively hating a specific implementation which is not that logical.
Imagine hating the things that you use rather than a pdf with text, fascinating.
Yes, ANSI syntax is surely composable or dynamic: EXTRACT(YEAR FROM DATE '1998-03-07')
Honestly, I think it's because a lot of folks go their entire careers avoiding set-based thinking. Many of these folks are talented programmers so I would assume they'd become quite decent at SQL if they devoted time to it. I'm speaking more on the DML side than the DDL side here.

In my experience <20% of developers are good enough to be dangerous with SQL, and maybe 5% what I'd consider adept.

The rest range from "SELECT * is as far as I'll go; where's the ORM" to "why the hell are you using cursors to aggregate"

SQL is powerful, elegant, and reliable. With modern DB support of JSON, ARRAY, statistical functions, and much more, SQL is probably the #1 most underutilized/improperly leveraged tool today. SQL-killers have been coming out for 40 years now and (for its domain!) SQL's lead is pulling farther away if anything.

*yes there are some questionable implementations, so please replace "SQL" with "PostgreSQL" if nonstandard SQL implementations have caused nightmares for you in the past.

The reason isn’t set based thinking avoidance. I think it is because fundamentally we want to work with smart objects and not rows of data in most cases.
More often than not that's a good example of set-based thinking avoidance. The application graveyard is full of projects that felt their transactional data was too sophisticated for "dumb rows" and reinvented an inefficient relational model. I'd love to hear more about these "smart objects" - they're usually not that smart and typically can be represented relationally.

Even for use-cases like graph based models you still find Twitter and Facebook using MySQL to build the graph on top of. It’s simply heavily abstracted for the majority of engineers at those companies (where I'd wager the <20% proficiency in set-based thinking holds true) but it still fundamentally relies on SQL.

Where can I learn about the relationship between set-theory and SQL?
I'm not too well-versed in the academic/mathematical background of SQL, other than than to say it's closer to relational algebra than Set theory, and the resource to read is probably the 1970 Codd paper: https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

SQL does implement the set operations of UNION, INTERSECT, and EXCEPT, but I meant "thinking in sets" more colloquially.

edit: this is more practical than theoretical but the author here actually does a nice job of discussing "set-based" thinking vs. procedural https://www.itprotoday.com/open-source-sql/fangraphs-tags-cl...

Discrete Mathematics with Applications by Epp. Chapters 5, 7, 10.

Fundamentals of Database Systems by Elmasri & Navathe. Part 2.

I wholly agree with you, but I'll say this:

When it comes to prototyping, I'm not going to fuck with something like Java-- I'm going to reach for Python. If I don't know what I'm doing to begin with, I don't need something telling me all the ways in which I'm doing it wrong.

Same goes for SQL/NoSQL. I loosely know what I need to model and may revise it arbitrarily. SQL does not lend itself to this. NoSQL was designed for it.

NoSQL is the "fuck you dad you can't tell me what to do" of the database world. SQL is your annoying dad that tries to institute curfews and won't let you go to underage drinking parties. In the end, it's the latter you're going to be calling from a holding cell, but there's a lot more fun you can have with the former.

In the case of SQL/NoSQL I think it is the other way around. In order to use a NoSQL database to its strength, you have to know all access path up front [1]. Because you are designing your table structure with all the queries in mind. In SQL on the other hand, you can always go for a normalized structure and you are pretty safe to be prepared for most requirements that are coming in the future.

Additionally, I think drafting a DDL schema is a great way to learn about a new application domain. It forces me to ask hard questions. And that improves my understanding of the domain. I guess that is similar to some people who like to prototype using Haskell type signatures.

[1] Unless you mean specifically a document store without schemas.

Hmm, it’s the other way around for me. The crazy friend may be fun to go drinking with, but I’d never rely on him for anything.
> If I don't know what I'm doing to begin with, I don't need something telling me all the ways in which I'm doing it wrong.

This is unrelated to this conversation, but this is my main beef with Rust. I love Rust (like a lot), but it's just not good for prototyping. End of non-pertinent rant.

SQL is great at what it is designed to do, and absolutely horrible for anything else. Sometimes I have had to use tools that only allow (a subset of) SQL for querying data. (Especially BI) Doing regex date validation in a json derived from a string in an sql dialect without function-definitions is horrendous. These kinds of "why the f do i have to use sql for this"-moments happened surprisingly often to me working as a data analyst.

But then I'm trying to do relatively simple things in Pandas or R, like changing column values based on multiple field value conditions, and it is a struggle requiring at least 17 brackets, while in SQL it would be a simple statement that anyone can understand after reading it once.

>Doing regex date validation in a json derived from a string in an sql dialect without function-definitions is horrendous.

Doing regex is horrendous, but doing it on SQL in a modern database is no more difficult than in a full-fledged programming language. Most modern DBs have strong JSON support and built-in regex functions

1. JSON_CAST/JSON_PARSE your data

2. REGEXP_EXTRACT() on the result, here's several date validator regex from a SO post (https://stackoverflow.com/questions/15491894/regex-to-valida...)

And that's it. In fact in many cases it's probably faster to do it natively in SQL than to export it to python or R and parse there.

The problem you are describing is probably rooted in the specific SQL dialect you had to use. Selecting from inside JSON strings and matching regular expressions should be a no brainer. And it is straight forward in PostgreSQL, e.g.
> simple things in Pandas or R, like changing column values based on multiple field value conditions, and it is a struggle requiring at least 17 brackets

With dplyr, wouldn’t this be mutate(x = case_when(… ~ y, TRUE ~ x)) or the same but with ifelse?

I haven't used much R, but how would you do something like:

  CASE WHEN
  SUM(daily_revenue) OVER (PARTITION BY department, TRIM(SUBSTR(region, 5)) IN ('North','West','Misc')) > 
  AVG(revenue) OVER (ORDER BY sale_time ASC rows BETWEEN 28 PRECEDING AND CURRENT ROW)
  AND NOT COALSECE(had_prev_month_party, FALSE)

  THEN pizza_party_points + 5

  WHEN <above> AND had_prev_month_party THEN pizza_party_points + 3

  WHEN MIN(sale_time) over (PARTITION BY department) = DATE_TRUNC('month', current_date) then 5

  ELSE GREATEST(pizza_party_points - 1, 0)

  END as pizza_party_performance_points_current

this example may be a bit esoteric but it actually draws from lots of real-world cases (comparing a partitioned sum with an overall average, checking conditions on other columns, messy data parsing and manipulation, implicit waterfall nature of CASE WHEN vs. explicit elseif, etc)
SQL is also a big constant of my programming career, I know it can do everything I need, but sometimes it is frustrating how its limitations make some simple things complicated.

For example, one of the biggest gripes I have with SQL is the fact that the table is the only output format, it doesn't allow outputting data with different cardinalities in a single query. The internal relational model of a RDBMS is extremely rich and powerful, but when extracting that data using SQL you have to leave all the richness behind and flatten the data into a simple dumb table. So as result we have to do much back and forth with separate queries based on results of previous ones (or even ugly hacks like concatenate higher cardinality data into a single field to then re-separate it in the application). A suitable successor for SQL should have a way to output a subset of the relational model with the relational part intact.

I'm not sure a successor to SQL is even necessary, rather there would be benefit to some higher level interfaces provided by the RDMBS that compliment SQL to cut down on the egregious application boilerplate that becomes necessary to achieve common tasks using SQL directly.

SQL is essentially the assembly language of the database. That makes it very powerful, but sometimes you just want a language that gives you a "garbage collector" for free, while still being able to drop down to assembly when you need additional power.

There is no technical reason why an RDBMS can't support different modes for accessing data. We are just so used to putting that work into the application that it has become a hard sell to want to move it to the right place.

> For example, one of the biggest gripes I have with SQL is the fact that the table is the only output format, it doesn't allow outputting data with different cardinalities in a single query.

With MSSQL at least you can return multiple result sets. Not sure about other database vendors.

SQL has some annoying quirks but on balance I am not sure there is the distaste for SQL that you claim. It's fine, even good in many circumstances, but also too low level for some tasks which pushes common problems onto application developers to solve over and over again or build up layers of dependencies to utilize someone else's solution. That's not the fault of SQL but the fault of the RDBMS.

Our layers of abstractions atop SQL along with approaches taken by DBMSes outside of the common relational names have shown that there is room for improvement within those tasks, able to be done natively by the DBMS, and it does not have to come at the expense of losing SQL when you need to be able to describe lower level questions. Different tools for different jobs, but little will within the RDMBS space to explore those other tools.

I don’t think that people have a distaste for SQL, I think that large companies have a vested interest in trying to sell no SQL solutions because they more tightly integrate into a “cloud” ecosystem. A database solves that problem domain too well, it’s not marketable.
Can you write an SQL query to return arbitrary JSON? Returning queried data in a nested form is a must nowadays.
> Can you write an SQL query to return arbitrary JSON? Returning queried data in a nested form is a must nowadays.

Yes. https://www.sqlite.org/json1.html

With just a splash of row_to_json and json_agg, you can JSON encode your entire query in PG.

  SELECT json_agg(row_to_json(t))
  FROM information_schema.tables as t;
Make more than a single query and nest on the application layer.
I blame ORMs. ORMs are promoted by scare mongering novice developers away from learning SQL in the first place. I'm ashamed to say I fell for it for a few years. When I eventually learned SQL it was like a fog being lifted from my mind that I hadn't even noticed before.
I tried not to use an ORM for my last project, but I you ended up rolling my own ORM for the application anyways because it was easier to keep the structures in the web application tied to the database. How else do you keep the two consistent?
The irony is, to effectively use an ORM, you need to be able to debug the SQL it generates when it’s not performing correctly (eg operating on a collection, row by row). I like to use ORM for row based, transactional operations in an OLTP. But usually look for the way to write straight SQL when doing OLAP style reporting stuff. Both of which happen in almost every business.
With SQL you kind of have two options/extremes that are unpleasant in their own way.

You either model things in a very domain specific and classic fashion. Here you get the benefit of being quite declarative and ad-hoc queries are natural. Also your schema is stronger, as in it can catch more misuse by default. But this kind of schema tends to have _logical_ repetition and is so specific that change/evolution is quite painful, because every new addition or use-case needs a migration.

Or you model things very generically, more data driven than schema driven. You lose schema strength and you definitely lose sensible ad-hoc queries. But you gain flexibility and generality and can cover much more ground.

You can kind of get around this dichotomy with views, perhaps triggers and such. In an ideal world you'd want the former to be your views and the latter to be your foundational schema.

But now you get into another problem, which is that homogeneous tables are just _super_ rigid as result sets. There are plenty of very common types you cannot cover. For example tagged unions, or any kind of even shallowly nested result (extremely common use case), or multiple result groups in one query. All of these things usually mean you want multiple queries (read transaction) or you use non-SQL stuff like building JSONs (super awkward).

If you can afford to use something like SQLite, then some of the concerns go away. The DB is right there so it's fine to query it repeatedly in small chunks.

I wonder if we're generally doing it wrong though, especially in web development. Shouldn't the backend code quite literally live on the database? I wish my backend language would be a data base query language first and a general purpose language second, so to speak. Clojure and its datalog flavors come close. But I'm thinking of something even more integrated and purpose built.

By moving access to your data-driven normalized tables into stored procedures representing a schema-driven API, you can have them both.

And you can lock down your critical production databases from arbitrary sql

Isn't that what the ActiveRecord pattern is supposed to be? (Something something n+1's and over fetching data.)
If only we had a generic GraphQL resolver for entity-based SQL schemas.

Oh wait, we do have Prisma. And it suffers from those same issues.

Have you looked at PostGraphile? It’s doesn’t have n + 1 or over-fetching issues.
I have, and I would recommend it, if it generated a Postgre schema from a GraphQL schema and not the other way around.

The advantages of GraphQL are its integration of backend and frontend workflows, and that won’t happen with PostGraphile: a frontend needing a schema change needs to go through the backend instead of the backend extending to meet the frontend in the middle.

> Except the most important problem: A pleasant API

A pleasant API is clearly not the most important business problem a database is there to solve.

The data in it is presumably the life and blood of the business, whereas the API is something only developers need to deal with.

But that aside, the interface will be SQL which is quite powerful, long-lived (most important) and, fortunately, very pleasant.

EdgeDB looks promising. Postgres under the hood so you know it's stable.
What is a pleasant API? For what kind of data?
I wonder if parent post meant something other than SQL?
That's also my reading. And I agree that SQL is a nice DSL yet way behind general programming languages we enjoy today.

I'm dealing with some pretty involved ETL at work these days, and it's really hard and ugly to do it in SQL to "stay in the DB" and keep things fast. It's tempting to read, transform in a high-level language (think pandas) then write back to the DB, but then you give up so much perf.

I would love another API than SQL like what Sparq is doing. You can keep a query optimizer, an engine to execute queries efficiently physically, etc, but it could be controlled with a nice general language API.

Another way to put it is that SQL became the standard and alternative haven't been developed much in comparison. Imagine like the ML world being stuck with Prolog or some DSL. It would be annoying to say the least. Instead people have a variety of APIs and the popular and most developed ones are actually using high-level languages to drive optimized ML engines. These convert function calls and chaining into efficient transformation that use memory, computing units (CPU, GPU, etc) efficiently. The same for DB would be wonderful.

Yeah, a lot of RDBMS are adding JSON support, but the support is often a bit clunky to use and may be missing important features. If you're dealing with a bunch of semistructured APIs that return JSON natively, Mongo makes it really easy to just dump all that into a collection and then just add indices as needed.
>Except the most important problem: A pleasant API.

For that, there are stored procedures.

I've found it's not just scale, but also down to query patterns across the data being stored.

I'm with you on using an RDBMS for almost everything, but worked on quite a few projects where alternatives were needed.

One involved a lot of analytics queries (aggregations, filters, grouping etc.) on ~100-200GB of data. No matter what we tried, we couldn't get enough performance from Postgres (column-based DBs / Parquet alternatives gave us 100x speedups for many queries).

Another was for storing ~100M rows of data in a table with ~70 columns or so of largely text based data. Workload was predominantly random reads of subsets of 1M rows and ~20 columns at a time. Performance was also very poor in Postgres/MySQL. We ended up using a key/value store, heavily compressing everything before storing, and got a 30x speedup compared to using an RDBMS using a far smaller instance host size.

I wouldn't call either of them massive scale, more just data with very specific query needs.

> Another was for storing ~100M rows of data in a table with ~70 columns or so of largely text based data. Workload was predominantly random reads of subsets of 1M rows and ~20 columns at a time.

Kimball's dimensional modelling helps a lot in cases like this, since probably there is a lot of repeated data in these columns.

It's pretty old problem as they are competing ideas. It's OLTP vs OLAP. Postgres is designed for OLTP.
Yeah, I think some of the problems are when both or needed on the same data, or when the use case changes over time.

E.g. our customers are stored in Postgres, so let's also log their actions there linked to the user table.

5 years on someone decides we need to run analytical queries across years of 200M logged actions, joined with other data in the DB.

So now we either have to live with horrible performance, migrate the logs to something suitable for OLAP (and lose all the benefits of a solid RDBMS), or have some syncing/export process to duplicate somewhere suitable for querying.

The other day I said to a junior dev, when you started planning a locking scheme to handle concurrency in you file based system it is time to swap to a db
Similarly. People don't use Object Modeling/Entity relation-ship diagrams anymore.

Every day, I see people struggling with problems that would be easy to understand if you had one. You don't even need to have an RDBMs. They are good just to model how things are related to each other.

Is 'Not performance bound, and dot knowing the future shape of your data' a valid reason? Less overhead on initial rollout to just Toss it up there.

> choosing something else is rarely the right decision

I think this is a little bit of a 'We always did it this way' statement.

There are circumstances where you really don't know the shape of the data, especially when prototyping for proof of concept purposes, but usually not understanding the shape of your data is something that you should fix up-front as it indicates you don't actually understand the problem you are trying to solve.

More often than not it is worth sometime thinking and planning to work out at least the core requirements in that area, to save yourself a lot of refactoring (or throwing away and restarting) later, and potentially hitting bugs in production that a relational DB with well-defined constraints could have saved you from while still in dev.

Programming is brilliant. Many weeks of it sometimes save you whole hours of up-front design work.

> usually not understanding the shape of your data is something that you should fix up-front as it indicates you don't actually understand the problem you are trying to solve.

This is a good point and probably correct often enough, but I also think not understanding the entire problem you are solving is not only common, but in fact necessary to most early-stage velocity. There is need to iterate and adapt frequently, sometimes as part of your go-to-market strategy, in order to fully understand the problem space.

> a relational DB with well-defined constraints could have saved you from while still in dev

This presumes that systems built on top of non-RDBMS are incapable of enforcing similar constraints. This has not been my experience personally. But its possible I don't understand your meaning of constraints in this context. I assumed it to mean, for instance, something like schemas which are fairly common now in the nosql world. Curious what other constraints were you referencing?

> There is need to iterate and adapt frequently, sometimes as part of your go-to-market strategy, in order to fully understand the problem space.

If you're pivoting so hard that your SQL schema breaks, how is a schemaless system going to help you? You'll still have to either throw out your old data (easy in both cases) or figure out a way to map old records onto new semantics (hard in both cases).

I agree with GP that this is a central problem to solve, not something to figure out _after_ you write software. Build your house on stone.

>If you're pivoting so hard that your SQL schema breaks, how is a schemaless system going to help you? You'll still have to either throw out your old data (easy in both cases) or figure out a way to map old records onto new semantics (hard in both cases).

I agree with your comment that it's a central problem to solve and that both options, throwing out data or map old records onto new semantics, is an endemic choice both stacks need to make. I don't agree that it's always possible to solve entirely up front though.

In my experience, it has been less so about whether the storage engine is schemaless or not, even many modern nosql stacks now ship with schemas (e.g. MongoDB). I think the differentiation I make between these platforms is mostly around APIs. Expressive, flexible semantics that (in theory) let you move quickly.

As an aside, I also think the differentiation between all these systems is largely unimpactful for most software engineers. And the choice often made is one of qualitative/subjective analysis of dev ergonomics etc. At scale there are certainly implementation details that begin to disproportionately impact the way you write software, sometimes prohibitively so, but most folks aren't in that territory.

Admittedly, my experience with MongoDB and Cassandra has gained some rust over the last decade, but what makes you say such nosql databases have expressive APIs? Compared to PostgreSQL they have miniscule query languages and it is very hard, if at all possible, to express constraints. And constraints, sometimes self-imposed sometimes not, are what makes projects successful, even startups. Many startups try to find this one little niche they can dominate. That is a self-imposed constraint. People tend to think freedom makes them creative, productive, and inventive, while in fact the opposite is the truth. With opposite of freedom I mean carefully selected constraints not oppression.
> schemas

The confusion there is due to the fact that non-R-DBMS (particular when referred to as noSQL) can mean several different things.

In this context I was replying to a comment about not knowing the shape of your data which implies that person was thinking about solutions that are specifically described as schemaless, which is what a lot of people assume (in my experience) if you say non-relational or noSQL.

That is the sort of constraints I was meaning: primary & unique keys and foreign keys for enforcing referential integrity and other validity rules enforced at the storage level. There are times when you can't enforce these things immediately with good performance (significantly distributed data stores that need concurrent distributed writes for instance - but the need for those is less common for most developers than the big data hype salespeople might have you believe) so then you have to consider letting go of them (I would advise considering it very carefully).

> This presumes that systems built on top of non-RDBMS are incapable of enforcing similar constraints.

Are you kidding? They never can.

The entire point of ditching the relational model is discarding data constraints and normalization.

Never? Many NoSQL stores are offering parity in many of the feature verticals that were historically the sole domain of RDBMS.

Mongo has always had semantics to support normalized data modeling, has schema support, and has had distributed multi-document ACID transactions since 2019 [1]. You don't have to use those features as they're opt-in, but they're there.

I know that full parity between the two isn't feasible, but to say they never can is a mischaracterization.

[1] Small addendum on this: Jepsen highlighted some issues with their implementation of snapshot isolation and some rightful gripes about poor default config settings and wonky API (you need to specify snapshot read concern on all queries in conjunction with majority write concern, which isnt highlighted in some docs). But with the right config, their only throat clearing was whether snapshot isolation was "full ACID", which would apply to postgres as well given they use the same model.

What is the point of using MongoDB with multi-document ACID transactions? Enabling durability in MongoDB is usually costly enough that you can't find a performance benefit compared to Postgres. With JSONB support in PostgreSQL, I dare say, it can express anything that MongoDB can express with its data model and query language. That leaves scalability as the only possible advantage of MongoDB compared to PostgreSQL. And the scalability of MongoDB is rather restrictive, compared to e.g. Cassandra.

And I would never trust a database that has such a bad track record, regarding durability as MongoDB, although I admit that PostgreSQL had theoretical problems there as well in the past.

> future shape of your data

Contrary to what people seem to assume, you actually can change the schema of a database and migrate the existing data to the new schema. There's a learning curve, but it's doable.

If you go schema-less, you run into another problem: not knowing the past shape of your data. When you try to load old records (from previous years), you may find that they don't look like the ones you wrote recently. And, if your code was changed, it may fail to handle them.

This makes it hard to safely change code that handles stored data. You can avoid changing that code, you can accept breakage, or you can do a deep-dive research project before making a change.

If you have a schema, you have a contract about what the data looks like, and you can have guarantees that it follows that contract.

Maintaining backwards compatibility for reading old records in code is not hard. You can always rewrite all rows to the newer format if you want to remove the conpat code, or if the structure changes in an incompatible way. It's pretty comparable to what you have to do to evolve the code/schema safely together.

Having schema is much better for ad-hoc queries though, doubly so if your schemaless types aren't JSON (e.g. protobufs).

With Postgres, you can always just have a JSONB column for data whose shape you're unsure of. Personally, I'd rather start with Postgres and dump data into there and retain the powers of RDBMS for the future, rather than the other way around and end up finding out that I really would like to have features that come out of the box with relational databases.

I think a valid reason for not choosing a relational database is if your business plan requires that you grow to be a $100B+ company with hundreds of millions of users. Otherwise, you will probably be fine with RDBMS, even if it will require some optimizing in the future.

Most $100B+ companies (e.g. Google, Meta, Amazon) were built primarily using relational databases for the first 10-15 years.
Postgres' JSON implementation perfectly adheres to the JSON spec, which actually sucks if you need to support things like NaNs, Inf, etc. It's a good option, but it doesn't work for all datasets.
Wow, TIL that NaN isn't valid JSON.
No longer an issue with things like Spanner, CockroachDB etc
as long as you're willing to write a whole lots of data validation, constraint checking scripts by hand in the future, ETL scripts for non-trivial analytical queries (depending on what NoSQL you chose, but if you chose it for perf this one is usual a price you have to pay). and keep a very rigorous track of the conceptual model of your data somewhere else, or simply don't care about its consistency when different parts of your not-schema have contradicting data (at that point why are you even storing it?)

and that you ruled out using a JSON string column(s) as a dump for the uncertain parts, de-normalization and indexing, and the EAV schema as potential solutions to your problems.

the point is noting is free, and you have to be sure it's a price your willing to pay.

are you ready to give up joins ?, have your data be modeled after the exact queries your going to make ?, for you data to be duplicated across many places ? etc ...

I think the tradeoff is similar to using a weakly typed vs strongly typed language. Strong typing is more up front effort but it will save you down the line because it's more predictable. Similarly, an RDBMS will require more up front planning and design and regular maintenance but that extra planning will save you more time down the line.
> Is 'Not performance bound, and dot knowing the future shape of your data' a valid reason?

That's a very good reason for going with a RDBMS even if looks like it's not the clearest winner for your use case.

If you invert any of those conditions, it may become interesting to study alternatives.

I find myself forced to model access pattern when choosing non relational dbs. This often results in a much less flexible model if you didn’t put a lot of thought into it. Ymmv
No absolutely not. 1 hr spent making a schema and a few hours of migrations is way less than the headaches you'll have by going nosql first.
It is very frustrating to work with engineers who don't understand the nuances of RDBMS and assume they can solve all the things. The small company I work for has 3B rows. We have a high write volume. Can you use an RDBMs database to solve this? Sure, but it would be a terrible waste of engineering effort.
Is that hard? Sounds like it might be a little expensive in terms of server resources but I don’t see why the engineering would be hard.

I’m curious how you handle this with less engineering effort without using an RDBMS.

A "high write volume" requires fast disks, and billions of rows require large disks. Two simple requirements that are the same for any relational or less relational database.

What's interesting is query performance, and a RDBMS supports explicit control over indexing (usually including analyzing execution plans to find out which queries are going to work well). Where do you see "a terrible waste of engineering effort"?

? as opposed to buggy reimplementation of subset of RDBMS functionality on the Application side?
Isn't this like saying you can solve every programming problem you have with <insert your favorite Turing-complete language here>? Of course you can, but aren't there any cases where the tradeoffs outweigh the benefits, even if it's about something selfish like ergonomics or, dare I say, fun?
I agree with you. However, conversely I don't see anything that proves him wrong. Databases are not like programming languages. There is a reason why we don't use punch cards anymore.
The premise here (I think, correct me if I'm mistaken) is that there are net-negative tradeoffs to using nosql/non-rdbms.

If that assumption is true, then it follows that the same argument used in the last statement also applies— that if you're not at massive scale, then its likely the aforementioned tradeoff of not using RDBMS is likely de minimis.

(This assumes that the tradeoffs are of the magnitude that they only manifest impact at scale, hard to address that without concrete examples though)

> (This assumes that the tradeoffs are of the magnitude that they only manifest impact at scale, hard to address that without concrete examples though)

The tradeoff is usually flexibility. You run into flexibility problems anytime requirements change. Scale doesn't factor in.

You're speaking my language. After more than 20 years of custom software dev, this statement has so much merit.
This rings true in my experience. SQL knowledge has consistently helped me over my career. A simple exercise in designing the relational data model vastly improves the system architecture.
Good point. Its often the problem space and other constraints that usually drive these decisions. Its important that you deal with problems when you have them.
Who wants to shard MySQL once a week at Amazon levels of scale? I prefer a managed service with consistent hashing.
Very frequently polled queues come to mind, but usually I'll use a db first anyway as there are benefits to it.
I just use files
Pretty easy. RDBMs have a shit API (SQL is terrible) and the largest (PostgreSQL) have a shit HA story. IMO you should think why you are using an RDBMs.

I have used both and have never regretted NOT using an RDBMs. Maybe its a taste thing but I'd rather use a simple K/V database than a relational database any day.

Hard disagree. The operational overhead of RDBMS and specifically their inherent reliance on a single primary node makes them, in my opinion, a bad place to start your architecture.

I want to be able to treat the servers in my database tier as cattle instead of pets and RDBMSs don't fit this paradigm well. Either NoSQL or NewSQL databases are, in my opinion, a much better place to start.

I feel like RDBMSs being the "default" option is because most people have worked with them in the past and already understand them. It doesn't mean they are the best tool for the job or even the tool most likely to solve the unknown problems you'll encounter in the future.

Only once have I worked on a project where a document database did not completely gimp our ability to deliver the data that was required of us, and that was only because that data was regularly cloned to a relational database we could use for asynchronous tasks. As a project grows, I have, without fail, come to find that you need relations to efficiently deliver the data that new requirements demand.
You can have multi tb postgree database, that are fast and usable whit limited number of cache layers for speed, but you probably don need it. mediums migrate from single postress in 2020.
I don't understand why this comment is down voted. I've been part of a project that uses a dozen different PostgreSQL databases from different services, one of those databases is multi TB with an OLAP usage patterns. And it beats performance-wise all the previous attempts to solve that specific problem.
Postgres is the Post-Ingres database. Originally it used QUEL (as in PostQUEL as in libpq) as a query language. When SQL support was added the product name was changed to PostgreSQL

Postgree isn’t a thing.

Postgres or PostgreSQL are acceptable. Postgre or postgree are not.

https://www.postgresql.org/docs/current/history.html

What is your go to NewSQL database these days (and why) out of interest?
What a joke.