Ask HN: Is PySPark a Dead-End?

Y	Hacker News new \| ask \| show \| jobs

9 points by passer_byer 1657 days ago

I am contracted by a major financial services firm to refactor an analytical model used for revenue forecasting to PySpark executing on a AWS EMR cluster.

The project's current status is documented[0].

The client's team responsible for operationalization was successful in refactoring another analytical model into Python/pandas. The current model execution time for a 5 year scenario is ~17 hours. Most of that time is spent executing poorly crafted Oracle SQL queries drawing millions of rows into the analytical run-time for, sorting, aggregation, discarding, merging, and spliting tasks.

In order to constrain this exeuction time, final input is a sample of ~1.8M rows from a loan portfolio of ~81M records.

The client is concerned about performance and believes PySpark is the preferred target language.

I have been on this project for just one month, but I contracted previously at the same firm on a six months to refactor another model into Python/pandas. That project was successful, mainly due to the team leader's rigor for meeting milestones and ability to remove blockers for the team.

I recently discussed these projects with @Travis Oliphant who had some interesting ideas on Python-based frameworks to overcome issues for processing out-of-core dataframes. We discussed the frameworks Dask[1], Coiled.io, commercial Dask support[2], Ray[3], Modin, commercial support for Ray[4].

Others discussed were, Databricks[5], bodo.ai[6], Voltron Data[7], and AtScale[8]. On Reddit, the commentary for Snowflake was very positive[9].

Easing maintenence burdens to keep the model in production and devising new scenarios (e.g. Covid-19 effects on forebarance requests) are requirements. Its shelf-life is years, making maintainability a major consideration.

What have others experienced in scaling out for teams familiar with Python/pandas for feature engineering tasks?

Is PySpark a dead-end libray in the Python ecosystem?

[0] https://www.pythonforsasusers.com/project_summary/current_project_status.html

[1] https://dask.org/

[2] https://coiled.io/

[3] https://docs.ray.io/en/ray-0.4.0/pandas_on_ray.html

[4] https://modin.readthedocs.io/en/stable/

[5] https://docs.databricks.com/languages/pandas-spark.html (which points to Apache's Pandas API on Spark)

[6] https://bodo.ai/

[7] https://wesmckinney.com/blog/from-ursa-to-voltrondata/

[8] https://www.atscale.com/autonomous-data-engineering/

[9] https://www.reddit.com/r/dataengineering/comments/r893rw/why_is_snowflake_so_popular/

7 comments

dagw 1657 days ago

Most of that time is spent executing poorly crafted Oracle SQL queries

Start by looking here. As much as we love to bag on Oracle, it is at its core a really fast and capable database. I don't know what you are doing, but doing anything with only 81M records shouldn't take 17 hours. Profile you SQL, rewrite it, if necessary bring in a Oracle SQL expert, and I'm pretty sure you will find some easy wins just here. Perhaps even enough to solve your performance problems. If you're doing relational database type work, it's hard to beat a relational database.

link

shoo 1657 days ago

It's pretty hard to give helpful advice without clearly understanding the existing situation and what the actual bottlenecks are.

E.g. maybe 15 of the 17 hour running time is because the database is doing sequential scans of some tables as some essential indices haven't being defined. Or maybe the indices are defined but the queries need to be written to take advantage of them. Or maybe the queries are blazing fast because the python scripts are taking it upon themselves to perform outer joins in very slow pure python code rather than just getting the database engine to do it. Or maybe all the queries are happening implicitly through SQLAlchemy ORM and the entire analysis is a fractal mess of lazy n+1 select antipattern OO nonsense, and most of the running time is actually network latency between the machine where the python sits and the machine where the database lives. Maybe 4 of the 17 hours of running time is due to compute heavy hot loops in pure python code that can be sped up 1000x if someone is willing to roll up their sleeves and spend a week rewriting as C / C++ / Cython code that lets the CPU loose to crunch numbers in arrays without allocating or hashing or reference counting or waiting for the GIL. Or maybe the entire thing is relatively well engineered, given the physics of the computations involved, and 17 hours is pretty reasonable!

If no one knows yet what the bottlenecks are, maybe spend a few days profiling stuff and comparing it to theoretical estimates of the throughput or processing speed that the hardware is capable of, assuming the system was making optimal use of the hardware, and try to figure it out. It'd be a bit unfortunate to not understand the bottlenecks and migrate everything to pyspark and end up with something that runs slower than the original version.

link

NumberCruncher 1657 days ago

> Most of that time is spent executing poorly crafted Oracle SQL queries drawing millions of rows into the analytical run-time for, sorting, aggregation, discarding, merging, and spliting tasks.

I always try to follow the rule-of-thumb of "if it can be done in the analytical DB, it should be done in the analytical DB". In my experience Oracle is pretty well suited for all of the "sorting, aggregation, discarding, merging, and splitting tasks". With proper indexing/partitioning processing 81M records shouldn't take 17 hours. Pulling all the data into python and then fighting the lack of (out-of-the-box) multi-threaded data processing capabilities seems to be part of the problem than of the solution.

In my current job if I have to do some analytical heavy lifting I just write the data to AWS S3 (parquet) and read the query-results back through AWS Athena (Presto) into python.

link

vanusa 1657 days ago

Most of that time is spent executing poorly crafted Oracle SQL queries drawing millions of rows into the analytical run-time for, sorting, aggregation, discarding, merging, and spliting tasks.

Depending on what goes on in between the lines of all that "sorting, aggregation, discarding, merging, and splitting" -- the core guts of what you're doing might quite easily done within Postgres.

And 81 million rows? That will fit on your laptop, easy (especially if many are discarded in the early stages of processing).

Or it perhaps might not fit so easily. But the basic point I'm trying to make here is: don't be afraid of simplicity. All other unknowns being equal, it's as good a starting point as any.

link

passer_byer 1656 days ago

The observations posted here are very useful, thank you for such detailed response.

link

kalu 1657 days ago

Spark is not dead. Not even close.

link

vanusa 1656 days ago

There's the issue of PySpark, as opposed to Spark itself.

Whose demise I haven't yet heard specific reports of, but then again -- maybe the blush has come off a bit? That's what the original poster was trying to ferret out.

link

apohn 1656 days ago

>There's the issue of PySpark, as opposed to Spark itself.

I'm a little confused by this comment.

PySpark is the Python interface you use for Spark. IME, PySpark is actually a really nice API. Your other options are Scala or Java. I think there's a R interface as well, but I'm think that lags behind PySpark.

Saying that PySpark is dying, but Spark is not is a very contradictory thing to say. If you look over the universe of Spark users, I'll bet there are more Python users than Java or Scala. There would have to be a very big shift in the Spark userbase for people to decide that PySpark is going to be deprecated or will start lagging behind other interfaces.

Despite the fact that Spark is based on Scala, I could see somebody slowing the development of the public Scala API before PySpark if some twist of fate required somebody to make that decision.

link

apohn 1656 days ago

So first things first, PySpark is not a dead, dying, or a dead-end. When Databricks and Spark die, then we'll see the end of PySpark. Adoption of Spark and Databricks is growing. I actually see Arrow errors in PySpark jobs in Databricks, so Arrow/Ursa/VoltronData is already being used in the guts of Databricks/Spark.

We actually see a lot of what you are describing at my current company. All our data is stored in a database that was once very popular, but is old now and not cloud based.

We have a couple of challenges.

1) The data engineering team typically loads data (typically 100s of millions of rows) into the database before anybody really decides how it will be used. Typically we (Data Scientists) only get access to views. The views and underlying tables are not indexed or partitioned as we need, so almost any query takes forever to run since it almost always results in a full table scan of the raw source table.

2) The database team is caught in a budgeting trap. There is a long term-migration to another cloud based database, so both people and financial resources are focused on that. The end result is that no further scaling of the current database provider is possible, which means that every complex query we make on this database creates more load, which makes every other query run slower. This database has a lot of users, tables, and queries, which means that a lot of the available people spend their time just making sure the database maintains some basic standard of performance.

3) Since any complex query (e.g. even a basic date based aggregation) increases the load on the server, user queries should only fetch and filter data. Aggregations and anything complex (e.g. a string operation) should be done downstream.

Based on these constraints, we basically have adopted the following.

1) Once the users (e.g. us) have defined a use case for the data, the Data Engineers/Data Platform team will index and partition the data properly. This typically results in a huge performance increase. Recently we had one view that went from 5+ hours for a basic query to less than 5 minutes.

2) All aggregations and data transformations are done in Databricks/PySpark.

3) Typically, after 2) if we convert the Spark Dataframe to a Pandas dataframe, data dataset is small enough to run anywhere.

One of the things to keep in mind is the people supporting it after your contract is over. I think Databricks/PySpark and Dask are fairly common and well known in the data community. Snowflake can probably help speed up the SQL queries once the data is moved into that, but I don't think it can cover some of the analytical things you can do in Spark.

Arrow/VoltronData doesn't seem like a fit for the use case you are describing unless you have bunch of developers trying to develop their own data engine with Arrow behind it.

link