| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by deshpand 1708 days ago
	Spark may be a mature solution for truly big data, in a SQL like fashion, 1TB and more. But I constantly see it being misused, even with datasets as small as 5GB. Maybe the valuation of the company reflects this 'growth' and 'adoption'. And data locality is a thing. You can't read terabytes from object storage (over http). The batch oriented, map reduce is not going to be conducive to too many ML algorithms where state needs to be passed around.

2 comments

dkarl 1708 days ago

> But I constantly see it being misused, even with datasets as small as 5GB.

I witness frequent desire from engineers to use it because they see it as a competency/expertise that will unlock jobs at bigger, more lucrative companies. Also, startups kind of beg for it because the business keeps asking, "Will this tech scale 100x?" If you ask for a solution that scales 100x, and your problems aren't well-defined yet, and by the way it would be nice if it does streaming, too, since we might need that someday, your engineers are going to err on the side of using a big, complete solution.

link

joelschw 1708 days ago

I'd take a lazy, typed data manipulation language over pandas all day

link

deshpand 1708 days ago

If you can completely stay away from Python/pandas, get all your work done with typed languages like Scala/Java, that's good. A lot of scientists and non-CS folks are using Python/R. They need to avoid mish mash of bringing in Spark and SQL for some bits and then getting back to Python/R. Native Python, especially, offers mature ways to handle data in the 100s GB data. Learning to incorporate Dask and Numba is going to be far easier than teaching all these folks distributed programming and spinning up Spark clusters, when that can be un-necessary in many cases.

link

secondaryacct 1708 days ago

Depends for what, Im a Java guy at heart, but honestly for quick little analysis, pandas is way faster, and I barely can code my way out of a paperback in Python.

I even hate python and would never use it ... but I cant find better than pandas for my crazy large time series and always bespoke questions from the biz.

link