Hacker News new | ask | show | jobs
by iwebdevfromhome 2179 days ago
What are your thoughts on AWS Glue/Spark ? We’re starting to have problems with data frames that won’t fit into memory anymore on 32Gb clusters and upgrading to the next option, a 64Gb cluster, is an expensive thing. We plan to migrate to glue as a long term solution but I think we need to figure out a short term solution to the issue while the migration takes place.

Thanks for the article, before it I only knew of Dask as a real alternative.

P.D. I just remember that I wanted to try Pandarallel as well, so you have any insight on this library ? Thanks!

2 comments

Not the OP, but moving to sparse matrices is probably going to give you the most bang for your buck. I would strongly suspect that those huge dataframes could be encoded sparsely in a much more efficient format.

To be fair, that's one of the reasons that Spark ML stuff works quite well. Be warned though, estimating how long a Spark job will take/how much resources it will need is a dark, dark art.

Be very, very sure that you understand how expensive Glue can get, especially with suboptimal code. I have seen bills 10x of same code running on emr spark clusters.