|
|
|
|
|
by Nican
2464 days ago
|
|
I am really tired of articles that talk about the different types of databases. People can make a graph databases act like relational databases, and vice-versa. Computers, in the end, are just a Turing machine. Just pay attention that the query that you are executing is actually doing the optimal solution. I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks. It is important for developers to understand the execution complexity of queries, and how data is distributed across a cluster. For example, I am usually surprised when people talk about "web-scale", but they do not understand the difference between a "merge-join" and a "hash-join". Or when people do not realize that a sort requires the whole result set to be materialized and sorted. |
|
In fairness, I think "web-scale" generally means the serving path of a website with (say) hundreds of millions of active users. In other words, a heavy OLTP workload. The total query volume is too high for a single-machine DBMS but each operation executed is probably simple. They may not be doing joins at all; many of these websites have gotten away with key/value databases. Where they are joining, most likely at least one side of the join is a small amount of data directly owned by calling end user. (In social products, the other side might be say the users table, to map userids to names, email addresses, profile photos, etc.)
Big joins are more likely to happen in offline paths but likely via something like MapReduce rather than in the database server, and that batch processing framework may use different terminology for similar needs.
In that context, I think it's relatively understandable why someone would be fuzzy on merge-join vs hash-join. There are other skills they might need that are specific to key-value or "NewSQL" databases like Bigtable or Spanner. I wouldn't expect someone who doesn't work on a "web-scale" system to know much about this. These skills aren't simply additive, and "web-scale" isn't necessarily harder, just different.
And then of course there's people who think they have a web-scale website when it's not popular enough that you need to give up on single-machine DBMSs. There's just no hard problem there: not expense of single operations, not overall expense.