Hacker News new | ask | show | jobs
by jcagalawan 2861 days ago
Even with Hadoop, there's still algorithmic decisions that you have to make like whether to choose a stripes or pairs strategy when synchronization is needed[1], or choosing appropriate data structures for real time queries [2]. Sure something like Hive and Pig might work well enough for certain queries, but for some of the more complex queries where writing bare MapReduce or Spark is needed then these data structures and algorithms concerns pop up quite fast.

[1] https://lintool.github.io/bigdata-2018w/slides/didp-part02b....

[2] https://lintool.github.io/bigdata-2018w/slides/didp-part09b....

2 comments

MapReduce is filled with algorithms. The core system is basically a merge sort engine (the shuffler). In fact, if you ever wondered about the question "sort 2M integers in 1M RAM", well, that was Jeff asking people if they could understand that shuffle was out-of-core.

Another example is lexicographic range sharding, which uses reservoir sampling to compute optimal tablet key split points by doing a constant-space-and-time heuristic sampling over the keyspace.

I used to think MR was just brute force, but it has many levels of algorithms. Probably too many- at some point it because hard to analyze how the system worked because of the various kinds of hedging and recovery strategies.

I noticed that talk comes from a University. I understand thats a totally different domain of work.

Secondly, you are not inventing any algorithm there. You are only using algorithm invented by others.

Thirdly, you are only deciding what solution works better.

Lastly, in an interview you have to invent this algorithm in 45 minutes.

None of this involves you to invent a new algorithm. At least not in 45 minutes. I doubt if the person giving that talk himself did it so quickly.