| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by marklit 1907 days ago

I've been using it for years with clients. It tends to sit in between source data and a final destination. Most data platforms are trying to take data from 10s if not 100s of sources and unify them. The variety of formats and sources is endless and I've often had to resort to using Python-based Airflow DAGs to collect data based on dates and store a cleaned up version of what was collected as a timestamped PQ file for Presto. Presto is then great for large scale transformations across the data and ad-hoc exploring.

Presto can be pointed at a lot of data stores but few external data providers offer ODBC-like interfaces. It seems to be either APIs or static file dumps for the most part. So Presto isn't going to be able to pull from these datasets alone.

In terms of security and maintenance, products like Redshift are easier to train traditional data warehouse people up on. The service is relatively cheap and has a nice UI for scaling.

The data world is extremely fragmented. Once firms have something in place changing it is going to be a struggle. Existing staff often gate keep and defend whatever technology they've staked their careers on. Once there are a lot of reports setup with any on data source migrating it could end up becoming a prolonged project which can be hard to sell.

It was quoted Snowflake had a $1M / day budget for sales and marketing. I'm not aware of any Presto consultancy spending that sort of money. Amazon does have Athena but they have countless other offerings which muddies the water.

2 comments

vizually 1907 days ago

@marklit,

Thanks for your insights. Great points on inertia and lack of big sales budgets. Agreed also on HDFC/data lake use cases with PQ files. However, regarding querying RDBMS, are you saying that Presto requires in ODBC/JDBC connectivity? Does Presto have an ability to connect with "native DB" drivers?

link

marklit 1906 days ago

I can't comment on native drivers. When I said ODBC-like interfaces I was trying to use a catch-all phrase for sources that look like a typical server-based data store (i.e. PostgreSQL, Hive, Kafka).

link

vizually 1907 days ago

also, looked at Snowflake's most recent quarterly results. They spent over $1.7 million/day in sales & marketing.

link