Hacker News new | ask | show | jobs
by wooly_bully 2418 days ago
The "spin up a big data cluster" bit at the beginning seems like either a straw man or just oddly out of touch. Who, when determining how to process something on the order of a 100gb file, even considered something like that?

Also, SQLite is a first class citizen in this space. Most if not all languages can easily load data to it, virtually any language used for analysis can easily read from it, and it's file-based so there's no reason to spin up a server. Finding 100GB on disk is much easier than 100GB in ram.

4 comments

Divide that file size by 10, and you're still in the range where I've had to argue that rolling out Spark is overkill.
People will disbelieve that, but it's absolutely true... I'll never forget the interview I did with an engineer who described an elaborate Hadoop-based solution to some past problem. When I asked him what type of data he was working with, he said, "Here, I'll show you," then whipped out his laptop and showed me a spreadsheet. It wasn't an extract of the data. It was literally a spreadsheet, manageable on a laptop, that he somehow decided needed a Hadoop cluster to process. (Also, who shows data from your current employer to a new prospective employer? Weird but true.)
I had an interesting experience a while back where it came to light that I was working on the same problem as another team in the org (this was a huge multinational), so a meeting was arranged so we could compare notes. The other team was slightly shocked to see that I could train a model in a minute or two, where it took them an hour or two using essentially the same algorithm.

They insisted that shouldn't be, because I was doing it on my laptop and they were using a high performance computing cluster. They of course wanted to know how my implementation could be so much faster despite running on only a single machine. I didn't have the heart to suggest that maybe it was because, not despite.

Ironically, I also got the implementation done in a lot fewer person-hours. I just did a straight code-up of the algorithm in the paper, where they had to do a bunch of extra work to figure out how to adapt it to scale-out.

This isn't to say that big data doesn't happen. Just that it's a bit like sex in high school: People talk about it a lot more than they actually have it, perhaps because everyone's afraid their friends will find out they don't have it.

What could be the reasons he wanted to use a cluster to process that spreadsheet? Just curious what was the problem they had.
Big ups for SQLite too. It's extremely easy to plug in (esp. if you're using Python) and rather performant. An added bonus is that you can run a simple query in SQL for free. It's just as easy as awk. If you do small/medium size of data processing (say, less than 100G), SQLite is a must tool to learn.
Many many people have proposed exactly that, especially just a few years ago when (not really) "big" data was a big fad and 100GB PCIe SSD cost more than the coins in the couch cushions.
I usually just use local spark for some large file that pandas cannot handle well. It's not perfect but it works for me.