| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by qeternity 2223 days ago
	Not being snarky, but is that basically a sql interface for map reduce across databases?

1 comments

Boxxed 2223 days ago

It's basically federated SQL. Nothing to do with map/reduce, really.

link

qeternity 2223 days ago

It loads everything from the data sources (presumably pushing as much down to the underlying database) and then does sql in ram. Pretty much the definition of map reduce. Federated sql wouldn’t give speed up across a single database but presto does.

link

zaphar 2223 days ago

Loading from data sources and doing SQL in ram is very much not the definition of Map Reduce.

link

cmckn 2223 days ago

I don't think it's a crazy comparison, but AFAIK Presto doesn't have a shuffle stage, which is fundamental in traditional MapReduce. They're both distributed computations, that's the comparison I would make.

Edit: thinking about it more, how could Presto accomplish joins and other SQL operations without a shuffle? Seems very similar to Spark SQL, which is just syntactic sugar for multi-stage MapReduces.

link

WookieRushing 2223 days ago

Presto can do shuffles when needed. But it has a smaller max size shuffle than Spark typically because everything has to stay in RAM

link

WookieRushing 2223 days ago

It is basically federated SQL to get all the data you need into mem.

Then it does a bunch of techniques, one of witch is map reduce, to do computation.

Its like you took a distributed SQL server and made it never have its own storage outside of memory. Instead it operates on wherever the data is living.

link

kyllo 2223 days ago

Is there any form of distributed query engine in use today that doesn't fit the definition of the MapReduce pattern? Is describing a distributed query engine as MapReduce still a meaningful distinction from some other, non-MapReduce approach?

link

gopalv 2223 days ago

> the definition of the MapReduce pattern?

Impala, Presto etc don't fit that model at all - they follow the Volcano model.

In this mode, they are not pure functional - if a task fails, there is no way to reproduce the output of that task.

The func within the map() was guaranteed to produce the same output for the same input across multiple attempts on failure or concurrently (for speculation).

Because of this, they can be faster as they do not wait for a task to be complete to run a subsequent stage & can pipeline better, but at the cost of failing all queries running on a node during a crash.

There are no retries for anything. This was deemed acceptable, if your hardware is reliable and the response to a failed query is just to "run it again", rather than per-node query recovery.

The reason for proper node failure tolerance for Spark/Tez/Flink etc are because they follow the functional model as closely as possible with exceptions for non-deterministic functions (say, UUID() in a SQL call).

The advantage of the failure tolerance is that these tools can push the whole cluster towards a single query performance when it is otherwise idle, because preemption can recover capacity out of a running system, if a higher priority query enters the system at a later point in time.

link