Hacker News new | ask | show | jobs
by ramesh31 1195 days ago
>90% of the work in these teams is not core ML and is more mundane work supporting these models, such as data piping, cleaning, feature generation, experimentation, and real-time serving. You'll get plenty of experience in working directly with ML systems.

MLOps is what you're describing, and it's probably the number one field I'd recommend someone to go down right now as a backend dev.

3 comments

This is the #1 thing VC backed startups are trying to automate away
They've been trying to automate away the "grunge" work of building and managing complex software systems since the 80s. Trust me, we'll be fine
To be fair, it's not as funny as automating data cleaning, on the principle that data scientists don't want to do it.

And yeah, lots of people dislike it, but you can't build models without an understanding of the data, so even if automated data cleaning became possible (unlikely) you'd still need to spend a load of time doing work on the dataset before building anything useful.

Some people seem to think they will be able to type "clean my data" into ChatGPT or similar and get a beautiful clean dataset. They are probably descendents of the people who said "COBOL means we don't need programmers any more".

Data cleaning requires a lot of judgement and domain knowledge. Imagine if an AI did clean your dataset. Are you just going to trust it (Hell no!)? Or are you going to spend ages trying to work out what it did, which doesn't seem much of an improvement.

I write data cleaning/ETL software and I'm confident that the need for my product is going to going up between now and when I retire.

Why is that your number one recommendation?
because 90% of the industry work is MLOps the pipeline usually goes 1. make a POC inside a Jupyter Notebook with some scrappy, data and off-the-shelf model, define metrics and train a baseline to see if the whole ML endeavour might even be worth it 2. Do error analysis, find better data, tune parameters, re-train to see if you can improve upon the baseline 3. Make the first deployment, setup data collection 4. Automate 2 as much as possible because data is ever changing and you want to try many more off-the-shelf models 5. Deploy new models and collect ever more feedback

4 and 5 are basically a while loop that never ends and that's mostly MLOps It still requires proper ML expertise, especially when things break tho

we've got 4 and 5 pretty automated... the real issue is (as you likely are alluding to) as #1/2/3 draw in new completely infeasible data to get at scale, and then wants you to re-train daily, 2x day, every 4 hours, continuously. Oh and your costs go through the roof and likely aren't worth the returns anymore chasing that .001%
Do you literally search for “MLops”?
Great starting point, with lots of info and podcast and whatnot.

https://mlops.community