| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Nican 2464 days ago

I am really tired of articles that talk about the different types of databases. People can make a graph databases act like relational databases, and vice-versa. Computers, in the end, are just a Turing machine. Just pay attention that the query that you are executing is actually doing the optimal solution.

I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks. It is important for developers to understand the execution complexity of queries, and how data is distributed across a cluster.

For example, I am usually surprised when people talk about "web-scale", but they do not understand the difference between a "merge-join" and a "hash-join". Or when people do not realize that a sort requires the whole result set to be materialized and sorted.

10 comments

scottlamb 2464 days ago

> For example, I am usually surprised when people talk about "web-scale", but they do not understand the difference between a "merge-join" and a "hash-join".

In fairness, I think "web-scale" generally means the serving path of a website with (say) hundreds of millions of active users. In other words, a heavy OLTP workload. The total query volume is too high for a single-machine DBMS but each operation executed is probably simple. They may not be doing joins at all; many of these websites have gotten away with key/value databases. Where they are joining, most likely at least one side of the join is a small amount of data directly owned by calling end user. (In social products, the other side might be say the users table, to map userids to names, email addresses, profile photos, etc.)

Big joins are more likely to happen in offline paths but likely via something like MapReduce rather than in the database server, and that batch processing framework may use different terminology for similar needs.

In that context, I think it's relatively understandable why someone would be fuzzy on merge-join vs hash-join. There are other skills they might need that are specific to key-value or "NewSQL" databases like Bigtable or Spanner. I wouldn't expect someone who doesn't work on a "web-scale" system to know much about this. These skills aren't simply additive, and "web-scale" isn't necessarily harder, just different.

And then of course there's people who think they have a web-scale website when it's not popular enough that you need to give up on single-machine DBMSs. There's just no hard problem there: not expense of single operations, not overall expense.

jorblumesea 2464 days ago

> Computers, in the end, are just a Turing machine.

This isn't a helpful abstraction. Electricity is just electrons after all, why think about how you're going to wire your house when it's all just atoms? Read a bunch of books on quantum physics, then become an electrician /s

> Just pay attention that the query that you are executing is actually doing the optimal solution.

Isn't that the entire reason for articles like this? eg: If you have data with large amounts of relations a no-sql database probably isn't the right approach.

> I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks.

Absolutely, if just for knowledge's sake, but these are largely not needed if you understand the high level use cases for any given DB. It's cool if you understand B-Trees but not strictly needed to use sql. In fact, many would this this not helpful.

threeseed 2464 days ago

This is non-sensical.

Take a highly nested JSON document and try and implement it in a relational database. You would have an O(1) lookup in MongoDB and a O(n) lookup in MySQL. Or good luck traversing a graph structure in MongoDB when you should've used Neo4J. So in order to have a performant database you do need to "pay attention" and ensure that your access patterns suit the database type.

Also Web Scale is about any approach used at the bigger web companies. And the type of people who use the term are not the same people who would be sitting there optimising SQL queries. So I wouldn't conflate the two.

judofyr 2464 days ago

> This is non-sensical.

Your examples are kinda proving yourself wrong though.

> Take a highly nested JSON document and try and implement it in a relational database. You would have an O(1) lookup in MongoDB and a O(n) lookup in MySQL.

In PostgreSQL (which is not MySQL, but it's a relational database) you can create an index for a column which contains JSON and the lookup for a nested field becomes O(1).

> Or good luck traversing a graph structure in MongoDB when you should've used Neo4J.

Facebook implemented the social graph on top of MySQL: https://www.facebook.com/notes/facebook-engineering/tao-the-....

> So in order to have a performant database you do need to "pay attention" and ensure that your access patterns suit the database type.

I think the point of the parent poster was that you don't need to ensure that your access pattern suite the database type, but that it fits well with the features the database provides. If you're looking at the database landscape today talking about a "database type" isn't really saying much at all.

threeseed 2464 days ago

That's because you took my examples and distorted them.

I specifically mentioned MongoDB and MySQL. Having a JSON type is basically a mini document database since it uses a different/enhanced query language and is often not relational with the rest of the schema.

I also specifically mentioned MongoDB and Neo4J. Yes you could implement a graph in MySQL but as your paper lists you will need to front it with a Memcache in order to cache the graph traversal lookups. With a graph database you generally don't need such a cache.

Nican 2464 days ago

That is exactly it. Some SQL databases (CockroachDB and Postgres come to mind) can have the option of having O(1) JSON lookup. It just has to implement the correct algorithm, and index the data the correct way.

It is not about the database type. It is about the underlying algorithms and data structures.

doh 2464 days ago

Well, not really. It's about how you store the data. For instance all graph data can be stored as hexastore [0] which can be stored literally in any database and you will get similar-ish performance out of them.

Some databases are optimized for certain workflows, so for instance they will store only "pointers" to certain data rather than copying them. In addition their clients obfuscate and abstract the storage layer from user.

But you can really store any data in any database, you just might have to do some extra pre/post-processing and multiple the storage requirements.

[0] http://karras.rutgers.edu/hexastore.pdf

zawerf 2464 days ago

Technical note: big oh isn't a useful measure here. Most databases use b-trees (yes, even mongo) so lookups are at best O(log(n)). That goes for you and the people replying too.

The constant factors are way more important here. It's a 1000x factor difference depending on how durable you need your data to be (whether you need to write to disk or a quorum of network nodes in multiple regions). That is basically the only thing that mattered in the recent mongo vs postgres benchmarks.

alexchamberlain 2464 days ago

When you say O(1)/O(n), what are you measuring the scale in relation to? Number of documents? Number of keys in the document you're looking for?

roywiggins 2464 days ago

Working in a relational-ish database built on top of a poorly-abstracted B-tree object store is among the top ten worst development hells I've personally endured. It was awful.

Yes, you can sort of make any database behave like any other database. But you don't want to. You want a power tool that has query planners built in that can efficiently solve the sort of problems you're dealing with.

manigandham 2464 days ago

The type of database and data model is critical. People trying to make one act like another has led to countless problems.

Everything you said about algorithms and query complexity is part of that foundation, not a replacement for it.

bradly 2463 days ago

> I am really tired of articles that talk about the different types of databases.

I read this article and learned things I didn't know.

What would you like to have done differently? The author not write the post? Or someone not post it to HN? I don't understand your complaint. If you'd like a different article, write a different article.

paulddraper 2464 days ago

> I am really tired of articles that talk about the different types of databases. People can make a graph databases act like relational databases, and vice-versa.

> I wish more time would be spent talking about the underlying algorithms

That's precisely the difference. How they store data is fundamental to what kinds of operations are fast. A row-based and columnar database can look rather similar (PostgreSQL/Redshift). But the performance characteristics are far different.

Did the article come up short?

Nican 2464 days ago

I think part of the problem is that databases can have a mix-and-match of features, and it is hard to classify one database into a single category. I think the article does little in actually helping a developer make an informed decision about what underlying data structures actually fits best with their scenario.

To give some examples, CockroachDB both has column-families and NoSQL characteristics. A column can be specified to be a JSON column, and it can have an inverted index.

Or MemSQL has both row-based and column-based tables, and they use an unorthodox index called "skip lists".

CockroachDB and MemSQL both have different applications and characteristis, but they are just cluttered under "NewSQL", as if it was just was some kind of "SQL but better".

ska 2464 days ago

> Computers, in the end, are just a Turing machine.

Famous last words :)

Detail, in the end, really matter.

tempWinHater 2463 days ago

Especially when you consider that computers are approximations of Turing machines.

Edit: typo

okram 2464 days ago

Check out mm-ADT: http://rredux.com/mm-adt/. The goal of the project is to develop a distributed virtual machine and bytecode specification to make it easier to integrate processors, storage systems, and query languages.

I recently gave a talk on mm-ADT at ApacheCon: https://www.slideshare.net/slidarko/mmadt-a-multimodel-abstr...

shostack 2464 days ago

Do you have any articles you'd suggest that in your opinion cover the appropriate underlying guts of them?

Nican 2464 days ago

Unfortunately, this has something that has been on my backburner for a while. I do not know of any articles that fill exactly the need that I am talking about. :(