| I am contracted by a major financial services firm to refactor an analytical model used for revenue forecasting to PySpark executing on a AWS EMR cluster. The project's current status is documented[0]. The client's team responsible for operationalization was successful in refactoring another analytical model into Python/pandas. The current model execution time for a 5 year scenario is ~17 hours. Most of that time is
spent executing poorly crafted Oracle SQL queries drawing millions of rows into the analytical run-time for, sorting, aggregation, discarding, merging, and spliting tasks. In order to constrain this exeuction time, final input is a sample of ~1.8M rows from a loan portfolio of ~81M records. The client is concerned about performance and believes PySpark is the preferred target language. I have been on this project for just one month, but I contracted previously at the same firm on a six months to refactor another model into Python/pandas. That project was successful, mainly due to the team leader's rigor for meeting milestones and ability to remove blockers for the team. I recently discussed these projects with @Travis Oliphant who had some interesting ideas on Python-based frameworks to overcome issues for processing out-of-core dataframes. We discussed the frameworks Dask[1], Coiled.io, commercial Dask support[2], Ray[3], Modin, commercial support for Ray[4]. Others discussed were, Databricks[5], bodo.ai[6], Voltron Data[7], and AtScale[8]. On Reddit, the
commentary for Snowflake was very positive[9]. Easing maintenence burdens to keep the model in production and devising new scenarios (e.g. Covid-19 effects on forebarance requests) are requirements. Its shelf-life is years, making maintainability a major consideration. What have others experienced in scaling out for teams familiar with Python/pandas for feature engineering tasks? Is PySpark a dead-end libray in the Python ecosystem? [0] https://www.pythonforsasusers.com/project_summary/current_project_status.html [1] https://dask.org/ [2] https://coiled.io/ [3] https://docs.ray.io/en/ray-0.4.0/pandas_on_ray.html [4] https://modin.readthedocs.io/en/stable/ [5] https://docs.databricks.com/languages/pandas-spark.html (which points to Apache's Pandas API on Spark) [6] https://bodo.ai/ [7] https://wesmckinney.com/blog/from-ursa-to-voltrondata/ [8] https://www.atscale.com/autonomous-data-engineering/ [9] https://www.reddit.com/r/dataengineering/comments/r893rw/why_is_snowflake_so_popular/ |
Start by looking here. As much as we love to bag on Oracle, it is at its core a really fast and capable database. I don't know what you are doing, but doing anything with only 81M records shouldn't take 17 hours. Profile you SQL, rewrite it, if necessary bring in a Oracle SQL expert, and I'm pretty sure you will find some easy wins just here. Perhaps even enough to solve your performance problems. If you're doing relational database type work, it's hard to beat a relational database.