| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dwhitena 3312 days ago

Thanks for sharing this story. My experience has been that data scientists and analysts aren't able to efficiently use Hadoop/Spark even in cases where it is warranted. These individuals don't generally like working with Java/Scala and/or haven't spend time understanding the underlying structures used (e.g., RDDs, caching, etc.). As a result, they either don't put their sophisticated modeling or analyses into production, or they hand off their application to other engineers to implement for production size "big data." This produces all sorts of problems and inefficiencies, not the least of which is the fact that the engineers don't understand the analyses and the data scientists don't understand the implementation.

My (biased, as I work for them) opinion is that something like Pachyderm (http://pachyderm.io/) will ease some of these struggles. The philosophy of those who work on this open source project is that data people should be able to use the tooling and frameworks they like/need and be able to push there analyses to production pipelines without re-writing, lots of frictions, or worrying about things like data sharding and parallelism.

For example, in Pachyderm you can create a nice, simple Python/R script that is single threaded and runs nicely on your laptop. You can then put the exact same script into pachyderm and run it in a distributed way across many workers on a cluster. Thus, keeping your code simple and approachable, while still allowing people to push things into infrastructure and create value.