Hacker News new | ask | show | jobs
Why hasn't Presto become industry standard?
14 points by vizually 1907 days ago
if Presto SQL can query every database under the Sun (from RDBMS, Hadoop HDFS, Kafka and ElasticSearch) why it has not become industry standard yet? What's the catch? What are its limitation?
5 comments

I've been using it for years with clients. It tends to sit in between source data and a final destination. Most data platforms are trying to take data from 10s if not 100s of sources and unify them. The variety of formats and sources is endless and I've often had to resort to using Python-based Airflow DAGs to collect data based on dates and store a cleaned up version of what was collected as a timestamped PQ file for Presto. Presto is then great for large scale transformations across the data and ad-hoc exploring.

Presto can be pointed at a lot of data stores but few external data providers offer ODBC-like interfaces. It seems to be either APIs or static file dumps for the most part. So Presto isn't going to be able to pull from these datasets alone.

In terms of security and maintenance, products like Redshift are easier to train traditional data warehouse people up on. The service is relatively cheap and has a nice UI for scaling.

The data world is extremely fragmented. Once firms have something in place changing it is going to be a struggle. Existing staff often gate keep and defend whatever technology they've staked their careers on. Once there are a lot of reports setup with any on data source migrating it could end up becoming a prolonged project which can be hard to sell.

It was quoted Snowflake had a $1M / day budget for sales and marketing. I'm not aware of any Presto consultancy spending that sort of money. Amazon does have Athena but they have countless other offerings which muddies the water.

@marklit,

Thanks for your insights. Great points on inertia and lack of big sales budgets. Agreed also on HDFC/data lake use cases with PQ files. However, regarding querying RDBMS, are you saying that Presto requires in ODBC/JDBC connectivity? Does Presto have an ability to connect with "native DB" drivers?

I can't comment on native drivers. When I said ODBC-like interfaces I was trying to use a catch-all phrase for sources that look like a typical server-based data store (i.e. PostgreSQL, Hive, Kafka).
also, looked at Snowflake's most recent quarterly results. They spent over $1.7 million/day in sales & marketing.
Lots of general reasons, inertia, etc. Often companies stick to 1-2 preferred technologies and adding Presto isn't seen as a gain (even though it helped Facebook quite a bit). I also suspect Amazon re-packing it as Athena reduced adoption to some extent.

If looking at Presto (now Trino), the main thing to keep in mind is that you inherit the limitations of the underlying data store.

Its best when the underlying store (+ the Db adaptor implementation) lets you parallelize work and keep each node busy, and avoid processing data unnecessarily. Hive/S3 columnar format data works great for this (IIRC this was a major early use case). Other sources like RDBMS will have natural limitations. Kafka has its own issues since each query generally means re-scanning a topic, etc.

I see the data bridges as most useful as a way to bring data into the native/optimal format. Then do the heavy lift work in Presto.

@gwittel, appreciate you sharing your insights. Will you be able to elaborate on "RDBMS will have natural limitations"? Can you provide a specific example?
Presto gets most of its speed from parallelizing work and taking advantage of columnar formats when it can.

In the case of a RDBMS can you get performance gains if you try to parallelize a query from many clients? It will depend on the DB adapter and query. In a random case, if you slice a query into N shards it’s not necessarily going to go faster. It’s still the same DB underneath bound by the same HW performance boundaries.

Yeah this is a common misconception. Trino and Presto were aimed to replace and speed up the Hive engine.

As you say gwittel, adding Trino to an RDBMS itself won't speed things up. However, if you have operational data sitting in that RDBMS and data sitting in a data lake somewhere on like S3, then you can quickly join those datasets together.

Trino does its best to take advantage of any existing indexes that the RDBMS has by doing a pushdown but won't return that data any faster than the underlying database could. But it's the joining with other data sources data sets that makes the RDBMS connector worthwhile.

If you have a 1GB customer dataset in mysql and a 100TB dataset in s3 of all your orders, then Trino will first run a quick query against your mysql database, get a list of customer ids that meet the query, and then will use that list to filter the order id.

SELECT * FROM mysql.db_name.customer AS c JOIN s3.db_name.orders AS o ON c.id = o.customer_id WHERE c.credit_card_num = 123456789;

This might be outdated info:

  * Two different Prestos, prestodb and prestosql for maximum confusion. (I think one renamed)
  * Making Controller highly available by default is hard
  * Autoscaling workers is not simple
  * Code very dependent on its own webframework that tries to do everything and lacks docs.
  * Resource planner for multiple queries is lacking
  * Worker configuration takes a lot of skill
All of these could be solved, but in most cases you can find other solutions where you get a simpler set of problems.
tqh, Sounds like you hit the nail on the head with your answer. Thank you very much for your insights.
Hey I'm a contributor to prestosql (the one that renamed to Trino). I'll provide a few of my opinions into some of these from the vantage point of our project.

* It's definitely confusing but pretty common in open source projects to see the original creators split off when corporate oversight interferes with the OS governance model. (https://www.computerworld.com/article/2746627/hudson-devs-vo...). This is especially true when, as the OP mentioned, it's a pretty cool tech and a lot of interest in it. Now that the names are different, it is clearing up a bit. We're hoping in a few years there will be one project standing so that you won't have to choose. I don't have to tell you which one I think it is.

* Active-active HA is not really necessary IMO as Trino is designed for low latency interactive queries in general. It can handle longer running batch queries but it gives up fault tolerance to fail fast and you just resubmit the query vs predecessors like Hive, Spark, etc... that handle ETL and long running batch processes efficiently but this adds complexity to the query to checkpoint the work. I could see the need for an active-passive HA to have on deck during a failure. Setting up your own active-passive HA is as simple as putting two coordinators behind a proxy and pointing your workers to the proxy address. Then you basically have the proxy run health checks and flip over in the event of an outage. Here's the issue to track native HA though https://github.com/trinodb/trino/issues/391.

* I'm not sure why autoscaling is said to be difficult. I think this is why you have kubernetes and docker to manage this type of workload.

* The only reason this is a pain to me is that engineers wanting to join our community and commit have a bit of a learning curve and depends heavily on us mentoring and guiding them on how the REST API works, which we don't mind. However, I agree with this choice from a design perspective for the user. If you want to use Trino, it's better not to be exposed to this implementation detail or mess with how this works. It will likely cause you more pain.

* This has improved in the last two years since we branched from PrestoDB 2019 (https://trino.io/blog/2020/01/01/2019-summary.html) and 2020 (https://trino.io/blog/2021/01/08/2020-review.html).

* Agreed, we are working on what's the better model here: https://github.com/trinodb/trino/discussions/6573

It adds complexity in infrastructure and becomes a separate translation layer. I haven’t run a presto deployment, but have worked adjacent to it, I do think it brings advantages. But a lot of client side ‘whatevers’ connect to a lot of different database/storage technologies already, adding another complex layer to the mix might not be worth it.
Dremio also appears to take a similar approach, but with more advanced caching features / query pushdown. Plus it has Apache Arrow at its heart. I think that would be my choice of solution in this space
@legg0myegg0 thanks for bringing up Dremio. Can Dremio connect with different databases RDBMS, Kafka, DataLake PRC file formats? Is the use case limited to certain data stores? Is the use-case primarily a united query engine (so that the code remains consistent across DB engines) or is the use case query acceleration?