Hacker News new | ask | show | jobs
by hcatlin 3697 days ago
I'm not really sure what the use cases are for collaboration on this kind of thing or what that means. Are there real-world use cases I'm not thinking of? "Collaboration" is a great buzzword, but what does it have to do with a data store?
2 comments

(full disclosure - pachyderm employee)

Good question. It's funny how much collaboration is overlooked. And you're right - it's not obvious how a data store can enable collaboration.

In the software engineering world, collaboration by means of git is so prevalent its like breathing air. There's no such thing today for data scientists! That's crazy! Because doing data science involves more variables than writing software alone.

Pachyderm stores your data in a git-like manner. We store the deltas and version the data so that its consistently reproducible. We also give you some nice tools to run any code alongside your data.

This enables some very basic workflows:

1 - You're trying to develop your analysis - so work on your code & lock your data

2 - You're trying to vet new data - develop and version your feature extraction and data together

3 - You're trying to work on some analysis w colleagues - fork the data + analysis to do your work ... then merge to make sure your work is compatible before deploying

There are many more ... but hopefully that makes it a bit more concrete

And I should add we talk about Collaboration and other design goals more here: https://pachyderm.io/dsbor.html
Hi, I'm one of the founders of Pachyderm but I'm actually drawing more on my experience as an Airbnb employee in this comment. We used Hadoop at Airbnb and had a lot of problems with collaboration on data science.

The biggest difference between data science tutorials and data science at a company is that you have to work with people, lots of people all of whom have different backgrounds, skill sets etc. In a setting like this it gets very easy for people to step on each other's toes. For example Alice is doing some analysis on a data set and she realizes that there's a format for this data that would greatly simplify her work. So she reformats the data, but unbeknownst to her this same data was being fed into our fraud model every night... which is now broken. So reformatting data that you don't own is a bad idea, Alice learns her lesson and the next time she's in this situation instead she makes a copy of the data first so that she won't mess things up for others. This is a pretty standard thing to with a Hadoop based data lake but it leads to much more subtle problems down the road.

By making a copy Alice is setting up her own little data silo, a month later when she goes to report her findings she'll discover that in suddenly contradicts with other data scientists work. Why? Because each data scientist has his/her own completely separate copy of the data. We had cases of side by side reports which greatly disagreed about how many users we had. People stop trusting data really quickly in an organization when there are glaring flaws like that.

In Pachyderm this workflow makes a lot more sense, if Alice wants to modify data then Pachyderm will give her a clean room to work in that won't affect any of her coworkers analysis. As data outside the clean room changes it can be automatically pulled in and processed using the same code she wrote before and when she's ready to present it to others she can share it publicly.