Hacker News new | ask | show | jobs
by qeternity 2223 days ago
Not being snarky, but is that basically a sql interface for map reduce across databases?
1 comments

It's basically federated SQL. Nothing to do with map/reduce, really.
It loads everything from the data sources (presumably pushing as much down to the underlying database) and then does sql in ram. Pretty much the definition of map reduce. Federated sql wouldn’t give speed up across a single database but presto does.
Loading from data sources and doing SQL in ram is very much not the definition of Map Reduce.
I don't think it's a crazy comparison, but AFAIK Presto doesn't have a shuffle stage, which is fundamental in traditional MapReduce. They're both distributed computations, that's the comparison I would make.

Edit: thinking about it more, how could Presto accomplish joins and other SQL operations without a shuffle? Seems very similar to Spark SQL, which is just syntactic sugar for multi-stage MapReduces.

Presto can do shuffles when needed. But it has a smaller max size shuffle than Spark typically because everything has to stay in RAM
It is basically federated SQL to get all the data you need into mem.

Then it does a bunch of techniques, one of witch is map reduce, to do computation.

Its like you took a distributed SQL server and made it never have its own storage outside of memory. Instead it operates on wherever the data is living.

Is there any form of distributed query engine in use today that doesn't fit the definition of the MapReduce pattern? Is describing a distributed query engine as MapReduce still a meaningful distinction from some other, non-MapReduce approach?
> the definition of the MapReduce pattern?

Impala, Presto etc don't fit that model at all - they follow the Volcano model.

In this mode, they are not pure functional - if a task fails, there is no way to reproduce the output of that task.

The func within the map() was guaranteed to produce the same output for the same input across multiple attempts on failure or concurrently (for speculation).

Because of this, they can be faster as they do not wait for a task to be complete to run a subsequent stage & can pipeline better, but at the cost of failing all queries running on a node during a crash.

There are no retries for anything. This was deemed acceptable, if your hardware is reliable and the response to a failed query is just to "run it again", rather than per-node query recovery.

The reason for proper node failure tolerance for Spark/Tez/Flink etc are because they follow the functional model as closely as possible with exceptions for non-deterministic functions (say, UUID() in a SQL call).

The advantage of the failure tolerance is that these tools can push the whole cluster towards a single query performance when it is otherwise idle, because preemption can recover capacity out of a running system, if a higher priority query enters the system at a later point in time.