Hacker News new | ask | show | jobs
by roystonvassey 2800 days ago
Great explanation and I love the fact that the entire presentation is a Jupyter Notebook!

A non-academic observation - the 'real-world' challenge of ML pipelines is what I call the 'last-mile' problem of ML - operationalizing your model. You begin to run into problems of:

1. How often do you 'score' live data? How will this affect latency, data ingestion etc?

2. How often do you have to update your weights, if you want your model's performance to be consistent?

3. Integration with source systems

4. If you build your final scoring model on library-dependent languages like Python, how do you ensure no breakages? (Docker solves this to a large extent though)

3 comments

Seconding this. I have run a data science and machine learning team for the last couple of years. By far the most challenging part of our work has been convincing our data management team that we aren't just another front end widget factory and our development/operations staff that we aren't choosing "non-standard" tech to deliver model results into production. The model maintenance is difficult, too, due to poor data management practices but it's less challenging than the other items for my team.
What have you found to work best when coordinating with your data management and development/operations staff?
Every organization and team is different. Often I've found two approaches work best: going around the road blocks and managing everything end to end, then getting buy-in for data and ops to own it properly after the fact (playing up the political angle of owning more stuff after we do the heavy lifting), and the brute force method of just meeting after meeting to educate people about the differences in use cases and deployment for ML products.
same question here. im keen to understand this. Especially around responsibilities, OKRs and KRAs
That’s it, really. Any good reference to keep up to date with the last-mile best practices for the average ML practitioner? Thanks!
I link these resources often, but they are often relevant! See "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction" [1] and the Rules of Machine Learning [2]. Another classic: "Machine Learning: The High Interest Credit Card of Technical Debt" [3], and recently added: Responsible AI Practices [4].

[1] https://ai.google/research/pubs/pub46555

[2] https://developers.google.com/machine-learning/rules-of-ml/

[3] https://ai.google/research/pubs/pub43146

[4] https://ai.google/education/responsible-ai-practices

It's a theme I like to read about (I mean "practical issues around ML in production settings"). I find company blogs and some research publications are great resources. Examples:

- https://eng.uber.com/ - https://code.fb.com/

and many more. Google also publishes papers on various engineering practices obviously, some ML-related, but I can't find a blog where they focus on that specifically.

Also it's not "to keep up to date", but there's a great paper (from Google) that's often cited:

Machine Learning: The High Interest Credit Card of Technical Debt https://ai.google/research/pubs/pub43146

It talks about issues you face over the long run (I've experienced some of those). It also provides interesting pointers for further reading, e.g. about "pipeline jungles".

If others have pointers, I'm curious to hear about them as well.

Could that be because of using Jupyter notebook itself? I like Jupyter for data and machine learning 'journalism', but I don't see it as the a proper medium to address the 'last-mile'. The insights driven from Jupyter, in my opinion, are not actionable and well integrated enough. It is becoming a de-facto medium reminding me of shared Excel files.
Could be. Using Jupyter for ML development or even prototyping (as opposed to presentations / demonstration / teaching like the OP — that's where Jupyter really shines) is a red flag.

I see a similar pattern with Pandas: some people use Pandas not because it's the right tool for the job (Pandas has many strengths), but because they're scared of writing comprehension loops and basic data structures. To avoid the CS-y stuff. But without the CS-y stuff, the result ends up a mess of lambdas, weird reindexing and buggy copy/view semantics.

And then "the next guy", the one who's job it is to clean up and productionalize the maverick's output, ends up having to reinvent and fix the entire solution. Basically doing both jobs.

How do you suggest prototyping without Jupyter? (in case prototyping means researching an approach)
Yes, Jupyter is for initial exploration. Then you write solid normal production code. Then you might write further notebooks that import that production code and run/visualize metrics and reporting for your client (probably non-technical people).

I had a "data scientist" submit notebooks to us as if we could ship any of that in production. (We fired him.) It's for hacking and blogging, not for production work.