Hacker News new | ask | show | jobs
by barneso 3707 days ago
Most teams I have seen have either template scripts or boilerplate that generates datasets, and share both the generated data and the scripts via normal ways that people share data and code: disk, S3, github, emailing of notebooks, etc.

It requires a fair amount of set-up, but works surprisingly well once there is a core team and problems established.

We are building mldb.ai to help bring the data and the algorithms for ML together in a less ad-hoc manner and to help move things out of research and into prod once they are ready. Many of the hosted ML solutions (Azure ML, Amazon ML, Google Data Lab, etc) and other toolkits (eg Graphlab) are working on similar ML workflow and organizational structure problems.

1 comments

Which projects you know use "disk, S3, github,..." to share their datasets? I'm curious what you think because I haven't read about any ML projects actually using hosted ML solutions like Amazon ML+S3. I've only seen Amazon recommend Amazon ML.
S3 is a good way to share files