| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mrmrcoleman 2404 days ago

I know that it is popular in data science because they need to track more things. Because:

1. All collaboration requires that collaborators are able to recreate a shared version of reality

2. This means version controlling all the things

3. For 'normal' software teams it's often ok to do this for just code and environment, hence git + docker

4. But for data science teams they need to worry about more variables; code, environment, training + test data, hyper-parameters, summery statistics...

GitLFS allows teams to track training and test data (up to 2GB unless you run your own server IIRC) which removes a lot of the headaches around building tooling to tie all these variables together with, for example, Git + Docker + S3.

Dotscience.com is a good example of a project trying to solve this neatly.

Disclaimer: I used to work there.