Hacker News new | ask | show | jobs
by anovikov 3231 days ago
Big data: they nailed it! Every single application of 'big data' i ever saw was hype-driven, and much easier solved by boring, traditional things that were there in K&R era already: hash maps, Berkeley DB, and memory-mapped files. Sometimes taking 100x less resources, like literally doing on a single computer what took a huge Hadoop cluster.

Maybe on a Google scale of data, that doesn't work as easily. Maybe when you have a billion dollar infrastructure bill, big data works better. But it leaves out 99% of companies who's data isn't that big.

2 comments

Shut your mouth--I can't keep charging $250/hr. unless people BELIEVE!

But seriously, I just came off a project last year that used Hadoop "because" and for no other discernible reason. I personally did a ton of studying on Data Science and then...crickets. Couldn't find anyone who really wanted that kind of work done. Maybe in time.

I'd be curious if anyone else has had the same experience re: Data Science.

From what I've seen so far it seems like the super big tech companies pay _all the money_ for ML/DS people; outside of that the pickings seem to get slim quick.

In all fairness, the super big companies (not only tech) are the only ones with any big data.
I used Hadoop well on an important fraud system. When we first deployed, we only dealt with 59 GB at rest worth of data per day. What I knew would occur is more models would run over the data with each release. I assumed that the models would become more complex over time. They would eventually need to perform calculations overs years worth of data.

Hadoop provided a data-centric approach to parallel computing. Using Cascading, a high level pipe/filter library for Hadoop, we could make complex, locally testable models. Using eventing we could plug those models into a self-managing workflow. Adding a new model meant starting a JVM for that model that hooked into the event system and ran Hadoop jobs as needed. If any one model failed, we could rerun it without affecting the rest of the system.

This scaled to 60 some odd fraud models that looked over up to 5 years worth of data (5 TB). Some were quick since they only looked at a day's worth of data. Some took several hours. In the end, Hadoop made the entire process easier to handle mentally, testable, and scalable.

Why do I feel a sudden urge to pay you $250/hr?