Hacker News new | ask | show | jobs
by jokoon 2923 days ago
I don't like to use SQL engine because I don't understand how they work, I never really know if my query will be O(1), O(log(n)), O(n), etc, or what kind of algorithm will optimize my query.

Who really does understand how a SQL engine work? Don't you usually require to understand how something work before starting using it? Which SQL analyst or DB architect really knows about the internals of a SQL engine? Do they know about basic data structures? Advanced data structures? Backtracking?

That's why I tend to avoid systematically using a SQL engine unless the data schema is very very simple, and manage and filter the data case by case in code. SQL is good for archiving and storing data, and work as an intermediary, but I don't think it should drive how a software works. Databases can be very complex, and unfortunately, since developers like to make things complicated, it becomes hairy.

I think SQL was designed when RAM was scarce and expensive, so to speed up data access, it has to be properly indexed with a database engine. I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

I tend to advocate for simple designs and avoid complexity as most as I can, so I might biased, but many languages already offers things like sets, maps, multimaps, etc. Tailoring data structures might yield good results too.

Databases still scare me.

12 comments

You're not scared, you're just too lazy to learn the tools of your trade.

Databases are not very complex and use pretty much only textbook data structures and algorithms. Understanding how they process a given query and how a query will probably perform/scale (even without EXPLAIN ANALYZE) is not hard to learn. You do need to learn it (at some point; you don't for small data, which is most). But it's far from difficult.

> That's why I tend to avoid systematically using a SQL engine unless the data schema is very very simple, and manage and filter the data case by case in code.

And that's the mentality that gives us webshops were applying a simple filter results in a couple seconds load time and uses hundreds of MB of RAM per request, server side.

Databases are amongst the most complex systems you will ever use as a developer. At the limit, they rival operating systems for complexity - distributed concurrent systems with lots of low-level memory and filesystem action, along with a parser, optimizing compiler and often a code generator too.
The performance characteristics of them, though, are easy to learn, and easier to get a grasp on via EXPLAIN and such.
Given a database and a query, yes, you can understand the execution profile.

What you can't do is start out with a schema and a query and understand the execution profile. Query planners take statistics into account when determining which indexes to use or to fall back to scanning, and can normally only examine a fraction of possible join orders. So you can usually only fully predict the performance profile of simple queries. Databases can run perfectly well one day and fall off a performance cliff the next, when statistics change; doesn't happen often, but it can happen.

More likely, something works perfectly well in test and in the early days, but rapidly becomes untenable as the data grows and superlinear slowdowns emerge.

Thus you need to put run-time monitoring into place, and have a routine process of examining slow-running queries, and fixing things that are slow: rewriting queries, adding indexes, possibly denormalizing data, materializing views, etc. It's an ongoing process in any application that works with lots of data.

I might add, those processes have little to do with SQL. When you have a lot of data, and a lot of queries, then you're going to have to monitor and optimize your databases.

I might also add that a basic understanding of data modelling and what an index can and can't do is sufficient to avoid many, many performance pitfalls (this again has mostly nothing to do with SQL per se). Any undergrad course on databases teaches these basics.

The variance provided by an enormous pile of state and an algorithm which uses that state means it's a bit more unpredictable than most other systems, where you're used to seeing lines, curves and almost always monotonic series.
While that may be true for some of them, a lot of databases are far simpler.

Sqlite for example is conceptually very simple, and it's fairly cleanly layered, so you can aim to understand the parser, the compiler and the bytecode execution engine separately.

It's also exceptionally well documented. Look at the "Technical and Design Documentation" section here [1].

It won't tell you everything about how a more complex database like e.g. Postgres works, but it will give a very good overview of most of what is relevant for a database user that wants a better understanding of why a database does what it does with a given query.

[1] https://sqlite.org/docs.html

>Databases are not very complex and use pretty much only textbook data structures and algorithms

skeptical expression

Whether they use highly tuned or proprietary or obscure algorithms (they do) is less important than the fact that understanding b-trees and basic normal forms will get you 90% of the way to understanding how to use one. It's just not that hard as a database user.
If you are used to imperative programming it does take a bit of time before you get to thinking is sets. Looking at the code I have inherited, even some reasonably experienced developers don't get to that stage.

Do you really think that replacing an "ORDER BY" statement with your own sort is going to be simpler?

"________ is not very complex and uses pretty much only textbook data structures and algorithms" is pretty much a false statement for any engineered product.
And the difference is usually close to nil for understanding them. (Also see squirrelicus' comment). Whether my database uses a B-tree or a proprietary, patented, insanely complex and finely tuned data structure that does essentially the same thing as a B-tree just a little bit faster is, for all intents and purposes, not a relevant distinction.

Now it's true that most databases have a ton of features and even more optional features that can (and will) interact in interesting ways, but I thought it was fairly obvious that most applications use very few or none of these. They are the 90/10-sort of features; 10 % of a database vendor's customers need 90 % of the specialized features in a database, and every single one of them uses a different handful.

Obviously you don't need to understand all these specialized features to use a database; you only need to grasp the handful if any at all you actually need at a time. Applications striving for wide database compatibility tend to rarely use any of these, simply because they don't exist in all databases, or work differently, or have divergent interfaces.

So any time you have an application that runs on MySQL or postgres in production but is developed and tested on SQLite (an antipattern itself, but I digress), you can be assured that you'll only see fairly basic DDL and SQL.

(You also seem to be intermingling understanding and building. I can use and understand how a typewriter works without having a clue how to build one. Yes, there are lots of hard problems solved by databases, but how they do it is mostly a don't care. I don't have to care how SQLite does power-fail-safe transaction, it does and what that means for me, is all I have to know as a user.)

>I don't like to use SQL engine because I don't understand how they work, I never really know if my query will be O(1), O(log(n)), O(n), etc, or what kind of algorithm will optimize my query.

Unless you're generating totally dynamic queries that's a moot point.

You can always try it and measure it -- just like you know, you would profile a program in any programming language. And you can trivially have the database show you the query plan as well.

Do you also not use APIs because you don't know a priori if a call is O(1) or O(N) or O(Nlog(N)) etc?

>I think SQL was designed when RAM was scarce and expensive, so to speed up data access, it has to be properly indexed with a database engine. I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

That's really orthogonal.

Speed and indexes still matter today with big data (or plain "tens of thousands of web users" loads), where we often have to denormalize or use indexed non-sql stores just to get more speed for the huge data we still need to be able to query fast.

Besides, something indexed will be faster whether they are in disk or in RAM compared to something in the same storage that's not indexed.

So unless we're coding something trivial, server side we still want all the speed we can get from our data than plain having them as simple structures RAM provides.

You wouldn't use a linked link as opposed to a hash table just because your data "fit in RAM". Even in RAM ~O(1) vs ~ O(N) matters [1].

SQL was invented and caught on because: companies had tried and were burned by no-sql stores with incompatible storage standards, lax reliability guarantees, no interoperability between client programs, no over-econmpassing logical abstraction (compared to relational algebra) but ad-hoc reinventions of the square wheel, and so on. Ancient issues the wave of Mongo fanboys brought back all over again.

[1] unless the linked list is so tiny as to fit in cache and avoid the hash table inderection, but I digress

Your concern about the opaque and abstract layers below you apply to language compilers as well (which I think is the "better" alternative you seem to prefer).

There is literature legion on the implementation of the database that would assuage you, should you concern yourself with reading it. I don't think you need to. Trust that many many smart people have engineered many many decades of excellent software.

That is, not to say, that you won't need to peek below or concern yourself with certain choices -- indices, commits, columnar-vs-row, etc. as your performance or access patterns dictate.

More importantly, the relational model is still the gem that shines as a beacon to model logic and data and is too often undervalued due to its association with 'enterprise software' and the implementation language (SQL is a bit warty).

100% agreed - it's exactly the same pattern as other tools. Write it in the most simple/obvious/maintainable way; then, if you have a performance issue (quite rare IME when building something that doesn't obviously need a database FTE from the outset), spend a few minutes semi-educatedly poking at it to see if you can stumble onto a drastic improvement; then, if not, dive deeper.
> There is literature legion on the implementation of the database that would assuage you, should you concern yourself with reading it. I don't think you need to. Trust that many many smart people have engineered many many decades of excellent software.

can you recommend some material?

"Architecture of a Database System" by Hellerstein, Stonebraker and Hamilton [1], gives a good overview. The source code and documentation of PostgreSQL is excellent if you want to dive deeper.

[1] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf

> I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

Keeping all your data in RAM has significant problems, even if it all fits. For example, would you want to lose all your customers' orders and billing information if your code crashed?

In addition to the relational database model, SQL databases offer ACID transactions, which are useful if you want to have consistent and reliable data:

https://en.wikipedia.org/wiki/ACID

To be fair, using redis or elasticsearch as a main datastore is doable. Although I'm not sure they're much better choices in terms of understanding how they work.

You could summon Antirez I guess

Doesn't redis by default have recovery via the filesystem enabled?
It does - BUT depending on how often you have it syncing changes to disk you can lose data.
> For example, would you want to lose all your customers' orders and billing information if your code crashed?

There are things like WAL and snapshots. Having your dataset in RAM and querying directly doesn't exclude persisting it to disk. Read Stonebraker's "The End of an Architectural Era"[0]. Basically the OP is right in that SQL DBs were designed assuming that RAM was scarce and that asumption is no longer valid. They are innefficient for every common use case. By at least an order of magnitude.

[0]: http://cs-www.cs.yale.edu/homes/dna/papers/vldb07hstore.pdf

> Read Stonebraker's "The End of an Architectural Era"

I tried, but it lost me in section 2.3:

>It seems plausible that the next decade will bring domination by shared-nothing computer systems, often called grid computing or blade computing.

No, it doesn't seem plausible at all. This has been, by some accounts, the future of computing, since at least the 80s.

https://en.wikipedia.org/wiki/Transputer

But shared-nothing is just too darned hard to program for.

Also, main memory is still scarce. We're just barely up to the 1TB of just some of the (small, by today's standards) databases the paper mentions. Ironically, it seemed to emphasis traditional business database needs over what might happen with the tech industry itself, which has turned out to be the main driving force behind database usage (and data creation).

> exclude

As a non-native speaker, I think that preclude is the word you're looking for. Not disagreeing with what you say.

For Postgres at least, you can literally ask it how a query works, via EXPLAIN. Now, there’s a skill to understanding the output of that, but at least it isn’t a black box.
All RDBMS engines have some kind of EXPLAIN, otherwise it's impossible to troubleshoot performance issues. The differences are in the amount of detail you get from the optimizer/query planner, whether you get profile of actual run side-by-side with the plan, etc.
There's something similar for SQL Server too.
And the query plan can literally change out from under you at any time. SQL sucks. You should be able to dictate the query plan to the engine directly. If SQL exists as a tool to create and serialize such plans via exploration and experimentation, that’s fine. As a runtime query system it is completely unsuitable.
SQL is a declarative language: You describe what you want, not how to get it.

If that's not what you need, there are plenty of procedural languages whith which you describe how to get things.

A query plan will change based on statistics. It's the engine's job to decide if your query is best served by parallelizing it to multiple cores, or deciding if it's worth JITing it before execution.

The actual execution of the query is an implementation detail of the engine, and should be. That's the entire point of a SQL standard: To provide users with an interface to talk to engines, which then retrieve the data you asked for.

It is its core strength and the reason why it's so successful.

"You describe what you want, not how to get it."

The entire archive of the pgsql users' mailing list disagrees. Every wants to know why their plan is suboptimal, or why it changed. People also want to know why the planner takes 100ms to generate the plan and only 1ms to execute the query, and so forth.

The idea that you just say what you want and you get the optimal result from your database is just ridiculous to anyone who has had to use them under any significant load.

> Every wants

This seems like mere selection bias.

As the sibling comment points out, users who have no problem aren't likely to post "Everything is fine!" to the mailing list. In fact, it would likely be rude to do so.

The thing is, 99.999999% of the time (and I'm probably missing a few 9's) the engine does exactly the most optimal thing.

However, database engines aren't perfect -- I know I've encountered bugs in older SQL server versions where the query never finishes but making some trivial adjustments fixes it. This is a bug. And most mailing lists are filled with people encountering bugs. Saying what you want and getting the best result is exactly what you should expect.

> The entire archive of the pgsql users' mailing list disagrees. Every wants to know why their plan is suboptimal, or why it changed.

And rather a lot of the archives of $scripting_language_of_your_choice are people confused about duck-typing/type system failures. That doesn't mean scripting languages should be replaced with statically typed ones; just that there are pain points in every system, and right (or wrong) tools for every job.

Don't believe me? Check how much of the FAQ traffic from first-time Rustaceans (or Swift/Java/etc. newcomers) has to do with how to satisfy their language's type system.

You pick your poison. SQL gives you a clearly defined set of tradeoffs up front. If that's not for you, no worries, move along.

The biggest tradeoff SQL gives you is exactly the one GP points out: it's a leaky abstraction where understanding performance is tied to implementation so much, that you pretty much lose all the benefits of SQL, except familiarity with the technology.
> You should be able to dictate the query plan to the engine directly.

That's like arguing you should be able to dictate the assembly that your compiler produces. The entire point of SQL (and most compilers) is that they can use their knowledge to optimize the result in ways that are too difficult or too involved for humans to do. Most people cannot out-optimize a compiler in the general case. And most people cannot out-optimize a SQL DBMS.

Also, with SQL, the result might be highly dependent on the data itself. A table with 1,000 rows yesterday might be queried entirely differently from the same table now with 100,000 rows. Are you going to constantly go back and dictate the query plan to the engine every few months as the data changes? Probably not. Use the tool as intended and you'll be fine. Anything else is premature optimization at best.

You're both right, there is a values mismatch here. The reason we need optimizing compilers to target modern CPUs is because the hardware architecture is so complicated. The root cause of the complexity is less essential concerns and more an artifact of the history of how things unfolded, compounded by the difficulty of disrupting the current local maximas that we're stuck in.
Funny enough, in the more recent lecture I posted for SQLite below, it actually doesn’t work that way. The query planner will generate the same plan every time provided you don’t run the analyze command over your data to generate new statistics.
You are almost certainly going to be worse at deciding how your query should run that the query planner will be.

A straightforward Postgres database will almost certainly fulfill your performance requirements unlsss you have a pretty edge-case scenario. Under those circumstances, it would be foolish to start by assuming that you are doing something that the database cannot easily accomplish. After all, it’s much easier to migrate the parts of your stack that require custom code to a new infrastructure when you reach scale, rather that trying to constantly patch bugs and performance issues in your shitty wannabe SWL engine!

IME query planners do a rather good job at this and scale nicely from small (few pages) to very large data sets. Missteps are rare, but will always happen. It's a heuristic, not magic.
You can force a certain query plan in SQL Server’s newer version(https://www.red-gate.com/simple-talk/sql/database-administra...) but most of the time query plans won’t change unless the statistics or cardinality of the data changes. You control both of those things.
> As a runtime query system it is completely unsuitable.

Millions of users beg to differ.

There are some tools to force specific hints/plans on the engine depending on the database e.g. SQL Plan Management in Oracle
> Who really does understand how a SQL engine work?

Presumably, at a minimum, all the people who work on such engines, including committees to the various open-source ones.

But also lots of other people.

> Don't you usually require to understand how something work before starting using it?

No. Very few programmers understand how compilers work before they start using them. I'd say it's more common to require working on something to really understand how it works than the reverse.

> I think SQL was designed when RAM was scarce and expensive, so to speed up data access, it has to be properly indexed with a database engine. I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

Indexing is no less important for in-memory data access.

This is what you get when developers are afraid to touch anything which doesn't look like Javascript or JSON.

The incompetence and ignorance shown in this post is simply astounding.

I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

It's also a question of price. Once you get above about 256 GB of RAM server prices start to go up really really fast. And while there are systems with dozens of TB of RAM they are stupidly expensive.

So even if, in theory, most databases could fit in RAM, most people cannot afford that. And at the end of the day, 100+TB isn't that large a database in the grand scheme of things and you're not easily fitting that into RAM.

> Once you get above about 256 GB of RAM

I think it might be as high as 1TB these days, though with what's going on with DDR4 prices, the situation is strange at the moment.

Of course, I don't disput your point that a 100+TB database isn't all that large, especially with indexes.

I suspect that it's this false dichotomy of "fit in RAM" and "big data" has resulted in many needless forays into distributed computing.

I like to break up data problems like this:

* Trivially Small

* Fits In RAM

* Fits on one server (CPU / Storage)

* Fits on one Big Hardware (Mainframe or other Specialized equipment)

* Requires Distributed Storage And/Or Processing

I think it's important to know that other boundaries exist, at least conceptually, but, ultimately, it's the practical considerations that are important. For example, for what you listed, what would you say are the storage cutoffs for each tier today?

> data problems

I find there is also, occasionally, a lack of awareness of when a problem is data-heavy, encouraged by abstraction layers like ORMs.

This can lead to casual or naive (neither meant derogatorily) distributed computing, where the app hoovers up the data from database to be processed. This can be great for anything CPU-intensive but terrible for the I/O-intensive.

> Fits on one Big Hardware (Mainframe or other Specialized equipment)

I don't think this really exists today, unless you're including an otherwise commodity high-end (e.g. 8-socket) server that carries up to a 4x price premium in "specialized equipment".

I'm aware that mainframes still exist, but, for a variety of reasons [1], I'd consider them as being in a world of their own, rather than a step on this continuum.

[1] e.g. inherently distributed architecture, not obvious if storage scale actually greater than high-end commodity, interop issues

Understanding the relational model properly will allow you to write performant simple code. Unlike replacing all the well tested code you will have to write for yourself when you get rid of a relational database.
I completely disagree.

Unless your data requirements are very specific, mem only, or not adapted to the relational paradigm any SQL engine will provide you with the best and more efficient algorithms to manipulate your data in the most common situations.

I think that if any, databases should be used more.

> Unless your data requirements are very specific, mem only, or not adapted to the relational paradigm any SQL engine will provide you with the best and more efficient algorithms to manipulate your data in the most common situations.

Too bad that Michael Stonebraker, Turing Award winner, disagrees with you. SQL are not the best solution for any common use case from the performance perspective.

Nevermind what they do to the design of an application. IMHO less people should default to using a database upfront. At least while protyping the idea.

https://cs.brown.edu/~ugur/fits_all.pdf

> IMHO less people should default to using a database upfront. At least while protyping the idea.

Surely it should be the opposite? While you're prototyping you should use a DB by default, then switch to your own implementation if you find out that it will speed things up (and you need that speed). It's not like the code that you would replace a DB with is going to be trivial, using a DB is going to keep the code simple until you need it to be complex.

You missed the "in the most common situations" part.

Modern RDBMS can handle millions of ops/sec and terabytes of data. They do the job fine 99% of the time and are constantly adding new features.

If you have the 1% need for another data store, there are hundreds of options, and interestingly many of them are also starting to implement SQL as an interface now.

How do you prefer to persist your data?
In memory database, dumped to file? Not saying how I would do it, just thinking alternatives.

I’m using postgresql usually

Ideally, if your DB fits into RAM, it should be served from RAM entirely.

If less frequently used parts of it don't fit, offload to disk.

Ideally, the DB should guarantee data integrity after a crash even if the DB is served from RAM.

That's the ideal scenario: You have the best of all worlds.

Coincidentally, that's exactly what Postgres does.

On top of that, it's the best NoSQL database currently available.

Of course, you can use it with SQL if you ever feel the need.

> On top of that, it's the best NoSQL database currently available.

Surely that depends on the load? From what I understand Postgress isn't easy to set up when the data spans more than one server, which many NoSQL database do actually handle fairly well.

That's got some seriously limiting characteristics that make it unusable for anything but the simplest settings. Note that the subject here (SQLite) is mostly used in an embedded context and that's roughly what you are describing. Even though that is a very large number of applications it is not the one you are likely to encounter when doing stuff in services for multiple users.
> I never really know if my query will be O(1), O(log(n)), O(n), etc, or what kind of algorithm will optimize my query

Aka leaky abstractions. SQL just wasn't designed for performance. A query language that takes performance into account should definitely ignore any ideas from SQL. Maybe have declared data structures instead of tables with operations that use known algorithms, explicitly chosen.

SQL just wasn't designed for performance.

That is strictly true in the literal sense that SQL is just a textual representation of relational algebra and calculus, and noone says a mathematical notation is "designed for performance" or otherwise.

But in a more practical, useful sense, it's the language most designed for performance, since the query planner has so much leeway to perform optimisation. It can do more dramatic transformations of the parse tree even than a C compiler.

If you cannot predict performance of a query you cannot write a query with guaranteed performance characteristics and cannot guarantee performance of your application. So, no, SQL and RDBMSs in general are neither designed for performance nor are any good at it. Which is one of the reasons we have the whole world outside of them.
> Which is one of the reasons we have the whole world outside of them.

The other reason seems to be a failure to learn (or a belief in "this time it's different") the lessons from the 70s and 80s that informed many of the fundamental design decisions of RDBMSes.

The "NoSQL" world has had a remarkable number of incidents with ACID failures. Unsurprisingly, a fix involves sacrificing performance.

This isn't to say that the trade-off is never worth it. In fact, RDBMSes can and do offer such trade-offs as options. It's just not the default.

It may be accurate that RDBMSes are designed and "shipped" with default configurations that are ACID-first [1] (to coin a term), whereas the "world outside" is performance-first [2].

However, it's nowhere near accurate, and maybe even disingenuous, to suggest that SQL or the relational model somehow prevents high performance. The reality of the actual tools contradicts your claim, as the sibling comment pointed out.

[1] with the exception of early MySQL which defaulted to MyISAM as a storage engine

[2] as the joke goes, so is writing to /dev/null and reading from /dev/zero

SQL and the relational model is designed for good performance rather than predictable performance. It may use statistics collected at runtime to improve performance, so this is not predictable - but it will be fast and getting faster. Hardcoding the query plan will most likely be slower unless you know the usage patterns up front - including the amount of data in the tables. This is very rarely the case.
you cannot... you cannot... cannot

You keep using that word. I don’t think it means what you think it means.

Because people have been doing it since the 1980s...