| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hashhar 1420 days ago

I must disclaim that I contribute to Trino.

I agree but it depends a bit on what purpose you are using them for. If you mainly use the tool to JOIN some data in bulk and then write output somewhere else (i.e. ETL) - either will serve you fine.

If you write complex queries with multiple filters and want to JOIN across multiple datasets - sure Spark can do that as well but it's not as efficient in pushing down computation to the source.

e.g. A query like SELECT c.custkey, sum(totalprice) FROM orders o INNER JOIN customer c ON o.custkey = c.custkey WHERE o.orderstatus = 'O' GROUP BY c.custkey; when ran on Spark will pull both tables into memory and then perform the join + filter for orderstatus = 'O' and then compute the sum.

While in case of Trino it'll push down the entire query into the remote database (in this case, in other queries it'll push down some parts of the query) so the source database will not need to return gigabytes of data over the network every time the query runs (and hence finish faster as well).

Trino tries to push-down some operations to the remote system which can be done more efficiently there. e.g. filtering on a column that has an index in the remote RDBMS will be faster than pulling all data and then filtering in Trino. Spark doesn't have strong pushdown and has to pull most of the raw data and then apply processing on top of it.

That's one of the main differences. Spark is a distributed job execution framework first while Trino is a distributed federated query engine first and it shows in their strengths and weaknesses.

If you want to run arbitrary user defined transformations on data then Spark definitely has much more to offer than Trino.