Hacker News new | ask | show | jobs
by gwittel 1908 days ago
Lots of general reasons, inertia, etc. Often companies stick to 1-2 preferred technologies and adding Presto isn't seen as a gain (even though it helped Facebook quite a bit). I also suspect Amazon re-packing it as Athena reduced adoption to some extent.

If looking at Presto (now Trino), the main thing to keep in mind is that you inherit the limitations of the underlying data store.

Its best when the underlying store (+ the Db adaptor implementation) lets you parallelize work and keep each node busy, and avoid processing data unnecessarily. Hive/S3 columnar format data works great for this (IIRC this was a major early use case). Other sources like RDBMS will have natural limitations. Kafka has its own issues since each query generally means re-scanning a topic, etc.

I see the data bridges as most useful as a way to bring data into the native/optimal format. Then do the heavy lift work in Presto.

1 comments

@gwittel, appreciate you sharing your insights. Will you be able to elaborate on "RDBMS will have natural limitations"? Can you provide a specific example?
Presto gets most of its speed from parallelizing work and taking advantage of columnar formats when it can.

In the case of a RDBMS can you get performance gains if you try to parallelize a query from many clients? It will depend on the DB adapter and query. In a random case, if you slice a query into N shards it’s not necessarily going to go faster. It’s still the same DB underneath bound by the same HW performance boundaries.

Yeah this is a common misconception. Trino and Presto were aimed to replace and speed up the Hive engine.

As you say gwittel, adding Trino to an RDBMS itself won't speed things up. However, if you have operational data sitting in that RDBMS and data sitting in a data lake somewhere on like S3, then you can quickly join those datasets together.

Trino does its best to take advantage of any existing indexes that the RDBMS has by doing a pushdown but won't return that data any faster than the underlying database could. But it's the joining with other data sources data sets that makes the RDBMS connector worthwhile.

If you have a 1GB customer dataset in mysql and a 100TB dataset in s3 of all your orders, then Trino will first run a quick query against your mysql database, get a list of customer ids that meet the query, and then will use that list to filter the order id.

SELECT * FROM mysql.db_name.customer AS c JOIN s3.db_name.orders AS o ON c.id = o.customer_id WHERE c.credit_card_num = 123456789;