| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by virmundi 3230 days ago

I used Hadoop well on an important fraud system. When we first deployed, we only dealt with 59 GB at rest worth of data per day. What I knew would occur is more models would run over the data with each release. I assumed that the models would become more complex over time. They would eventually need to perform calculations overs years worth of data.

Hadoop provided a data-centric approach to parallel computing. Using Cascading, a high level pipe/filter library for Hadoop, we could make complex, locally testable models. Using eventing we could plug those models into a self-managing workflow. Adding a new model meant starting a JVM for that model that hooked into the event system and ran Hadoop jobs as needed. If any one model failed, we could rerun it without affecting the rest of the system.

This scaled to 60 some odd fraud models that looked over up to 5 years worth of data (5 TB). Some were quick since they only looked at a day's worth of data. Some took several hours. In the end, Hadoop made the entire process easier to handle mentally, testable, and scalable.

1 comments

alexeiz 3230 days ago

Why do I feel a sudden urge to pay you $250/hr?

link