| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sixhobbits 2179 days ago

Oh, I wrote this :) I submitted it last week but it didn't get much attention then.

Happy to answer questions as far as possible. I have used Pandas extensively but I don't have deep experience with all of these libraries so I learnt a lot while summarising them.

If you know more than I do and I made any mistakes, let me know and I'll get them corrected.

4 comments

iwebdevfromhome 2179 days ago

What are your thoughts on AWS Glue/Spark ? We’re starting to have problems with data frames that won’t fit into memory anymore on 32Gb clusters and upgrading to the next option, a 64Gb cluster, is an expensive thing. We plan to migrate to glue as a long term solution but I think we need to figure out a short term solution to the issue while the migration takes place.

Thanks for the article, before it I only knew of Dask as a real alternative.

P.D. I just remember that I wanted to try Pandarallel as well, so you have any insight on this library ? Thanks!

link

disgruntledphd2 2179 days ago

Not the OP, but moving to sparse matrices is probably going to give you the most bang for your buck. I would strongly suspect that those huge dataframes could be encoded sparsely in a much more efficient format.

To be fair, that's one of the reasons that Spark ML stuff works quite well. Be warned though, estimating how long a Spark job will take/how much resources it will need is a dark, dark art.

link

ramraj07 2179 days ago

Be very, very sure that you understand how expensive Glue can get, especially with suboptimal code. I have seen bills 10x of same code running on emr spark clusters.

link

laactech 2179 days ago

With Ray not having released a 1.0.0 version yet, does that give you any pause about adopting it for a professional project? In the article, you've given it an A for maturity, but the criteria didn't include versioning.

I've worked professionally with data scientists, and we've used both Dask and Ray with some success. Scaling pandas will be an issue for a long time to come with a lot of data science code being written in Python with Pandas.

link

mumblemumble 2179 days ago

That's such a tricky one, because the version number doesn't necessarily indicate the maturity.

For example, as of the beginning of this year, the latest version of Pandas was 0.25. ((t jumped to 1.0 in late January.) This despite it having been a core part of the standard professional toolset for years and years now.

link

erezsh 2179 days ago

Do you also have experience working with SQL databases? If so, how do they compare to Pandas in terms of performance? (with or without these extensions)

link

sixhobbits 2179 days ago

It Depends (tm). I think SQL is one of the most underrated and underused languages and can often significantly out-perform Python for basic operations such as filtering and pivoting data.

That said, it's hard to keep SQL readable when doing more complicated data analysis, and you'll probably want the flexibility of Python the moment you start to do anything more custom.

link

ris 2179 days ago

Therefore SQLAlchemy.

link

ramraj07 2179 days ago

I work with sqlalchemy a _lot_. Love it. But it's not intuitive at all especially for the majority of folks with a DS background.

link

vasili111 2179 days ago

What you think about Spark and PySpark?

link

nautilus12 2179 days ago

Im going to give you my slightly biased and annoyed answer. It seems like people that use python tend to look down on spark as "too complicated" being written in Scala. I come from Scala background and now feeling forced into using python for my data work due to the momentum it has now I am still amazed at how quickly some simple requests like using a different image or having to attach some jars can make python people be like "whooa that's complicated, how can anyone like spark." Personally I love spark (for all it's quirks),and I think that the spark dataframe is much more mature in many ways to pandas, and the sanity type driven programming brings to table, and im kind of sad that im probably going to have to use python the rest of my career because there are so many fires it causes and a real strong tendency to kick many things down the road. The community just generally strikes me as very impatient.

link

disgruntledphd2 2179 days ago

Spark is really, really good. It's a massive leap from the Python/R model of play around with a data.frame till I have a model, then wrap it up in a script for a lot of data scientists though, which causes problems.

Spark is ace as it has an SQL API available cross-language, which makes ETL much more effective, and ML models (though I've always been sort-of suspicious about their maturity).

tl;dr - demonstrate the speed of running regressions in Spark, and many (most) data scientists will invest the time in learning the tool.

link