| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by apohn 1656 days ago

So first things first, PySpark is not a dead, dying, or a dead-end. When Databricks and Spark die, then we'll see the end of PySpark. Adoption of Spark and Databricks is growing. I actually see Arrow errors in PySpark jobs in Databricks, so Arrow/Ursa/VoltronData is already being used in the guts of Databricks/Spark.

We actually see a lot of what you are describing at my current company. All our data is stored in a database that was once very popular, but is old now and not cloud based.

We have a couple of challenges.

1) The data engineering team typically loads data (typically 100s of millions of rows) into the database before anybody really decides how it will be used. Typically we (Data Scientists) only get access to views. The views and underlying tables are not indexed or partitioned as we need, so almost any query takes forever to run since it almost always results in a full table scan of the raw source table.

2) The database team is caught in a budgeting trap. There is a long term-migration to another cloud based database, so both people and financial resources are focused on that. The end result is that no further scaling of the current database provider is possible, which means that every complex query we make on this database creates more load, which makes every other query run slower. This database has a lot of users, tables, and queries, which means that a lot of the available people spend their time just making sure the database maintains some basic standard of performance.

3) Since any complex query (e.g. even a basic date based aggregation) increases the load on the server, user queries should only fetch and filter data. Aggregations and anything complex (e.g. a string operation) should be done downstream.

Based on these constraints, we basically have adopted the following.

1) Once the users (e.g. us) have defined a use case for the data, the Data Engineers/Data Platform team will index and partition the data properly. This typically results in a huge performance increase. Recently we had one view that went from 5+ hours for a basic query to less than 5 minutes.

2) All aggregations and data transformations are done in Databricks/PySpark.

3) Typically, after 2) if we convert the Spark Dataframe to a Pandas dataframe, data dataset is small enough to run anywhere.

One of the things to keep in mind is the people supporting it after your contract is over. I think Databricks/PySpark and Dask are fairly common and well known in the data community. Snowflake can probably help speed up the SQL queries once the data is moved into that, but I don't think it can cover some of the analytical things you can do in Spark.

Arrow/VoltronData doesn't seem like a fit for the use case you are describing unless you have bunch of developers trying to develop their own data engine with Arrow behind it.