Hacker News new | ask | show | jobs
by sixhobbits 2179 days ago
Oh, I wrote this :) I submitted it last week but it didn't get much attention then.

Happy to answer questions as far as possible. I have used Pandas extensively but I don't have deep experience with all of these libraries so I learnt a lot while summarising them.

If you know more than I do and I made any mistakes, let me know and I'll get them corrected.

4 comments

What are your thoughts on AWS Glue/Spark ? We’re starting to have problems with data frames that won’t fit into memory anymore on 32Gb clusters and upgrading to the next option, a 64Gb cluster, is an expensive thing. We plan to migrate to glue as a long term solution but I think we need to figure out a short term solution to the issue while the migration takes place.

Thanks for the article, before it I only knew of Dask as a real alternative.

P.D. I just remember that I wanted to try Pandarallel as well, so you have any insight on this library ? Thanks!

Not the OP, but moving to sparse matrices is probably going to give you the most bang for your buck. I would strongly suspect that those huge dataframes could be encoded sparsely in a much more efficient format.

To be fair, that's one of the reasons that Spark ML stuff works quite well. Be warned though, estimating how long a Spark job will take/how much resources it will need is a dark, dark art.

Be very, very sure that you understand how expensive Glue can get, especially with suboptimal code. I have seen bills 10x of same code running on emr spark clusters.
With Ray not having released a 1.0.0 version yet, does that give you any pause about adopting it for a professional project? In the article, you've given it an A for maturity, but the criteria didn't include versioning.

I've worked professionally with data scientists, and we've used both Dask and Ray with some success. Scaling pandas will be an issue for a long time to come with a lot of data science code being written in Python with Pandas.

That's such a tricky one, because the version number doesn't necessarily indicate the maturity.

For example, as of the beginning of this year, the latest version of Pandas was 0.25. ((t jumped to 1.0 in late January.) This despite it having been a core part of the standard professional toolset for years and years now.

Do you also have experience working with SQL databases? If so, how do they compare to Pandas in terms of performance? (with or without these extensions)
It Depends (tm). I think SQL is one of the most underrated and underused languages and can often significantly out-perform Python for basic operations such as filtering and pivoting data.

That said, it's hard to keep SQL readable when doing more complicated data analysis, and you'll probably want the flexibility of Python the moment you start to do anything more custom.

Therefore SQLAlchemy.
I work with sqlalchemy a _lot_. Love it. But it's not intuitive at all especially for the majority of folks with a DS background.
What you think about Spark and PySpark?
Im going to give you my slightly biased and annoyed answer. It seems like people that use python tend to look down on spark as "too complicated" being written in Scala. I come from Scala background and now feeling forced into using python for my data work due to the momentum it has now I am still amazed at how quickly some simple requests like using a different image or having to attach some jars can make python people be like "whooa that's complicated, how can anyone like spark." Personally I love spark (for all it's quirks),and I think that the spark dataframe is much more mature in many ways to pandas, and the sanity type driven programming brings to table, and im kind of sad that im probably going to have to use python the rest of my career because there are so many fires it causes and a real strong tendency to kick many things down the road. The community just generally strikes me as very impatient.
Spark is really, really good. It's a massive leap from the Python/R model of play around with a data.frame till I have a model, then wrap it up in a script for a lot of data scientists though, which causes problems.

Spark is ace as it has an SQL API available cross-language, which makes ETL much more effective, and ML models (though I've always been sort-of suspicious about their maturity).

tl;dr - demonstrate the speed of running regressions in Spark, and many (most) data scientists will invest the time in learning the tool.