| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jdoliner 3698 days ago

Hi, I'm one of the founders of Pachyderm but I'm actually drawing more on my experience as an Airbnb employee in this comment. We used Hadoop at Airbnb and had a lot of problems with collaboration on data science.

The biggest difference between data science tutorials and data science at a company is that you have to work with people, lots of people all of whom have different backgrounds, skill sets etc. In a setting like this it gets very easy for people to step on each other's toes. For example Alice is doing some analysis on a data set and she realizes that there's a format for this data that would greatly simplify her work. So she reformats the data, but unbeknownst to her this same data was being fed into our fraud model every night... which is now broken. So reformatting data that you don't own is a bad idea, Alice learns her lesson and the next time she's in this situation instead she makes a copy of the data first so that she won't mess things up for others. This is a pretty standard thing to with a Hadoop based data lake but it leads to much more subtle problems down the road.

By making a copy Alice is setting up her own little data silo, a month later when she goes to report her findings she'll discover that in suddenly contradicts with other data scientists work. Why? Because each data scientist has his/her own completely separate copy of the data. We had cases of side by side reports which greatly disagreed about how many users we had. People stop trusting data really quickly in an organization when there are glaring flaws like that.

In Pachyderm this workflow makes a lot more sense, if Alice wants to modify data then Pachyderm will give her a clean room to work in that won't affect any of her coworkers analysis. As data outside the clean room changes it can be automatically pulled in and processed using the same code she wrote before and when she's ready to present it to others she can share it publicly.