Building Thousands of Reproducible ML Models with pipe

Y	Hacker News new \| ask \| show \| jobs

	Building Thousands of Reproducible ML Models with pipe (data.blog)
	49 points by datadem 2721 days ago

2 comments

dinedal 2721 days ago

No source link? No access to this "magical pipe"? Is this a showcase of proprietary software?

link

datadem 2720 days ago

It is meant to be a showcase of what data scientists work on in companies whose businesses do not rely on accuracy of ML models.

For example, data science work at companies like LinkedIn or Facebook can end up requiring concentrated focus on model performance as they have highly developed ML capabilities.

Data scientists at smaller or less data-driven companies end up running ad-hoc queries for marketing or product teams. We are somewhere in-between at Automattic and we find value in sharing our day-to-day work so other data scientists and companies know what to expect when they hire or get hired as someone to do data science work - which is now such a broad definition that it barely means anything.

I also wanted to share our learnings about going from ad-hoc queries, one-off models and solutions to general frameworks, and make a case for custom ML pipelines by showing how this kind of work is so closely coupled with internal data.

It really isn't very magical, though. It is mainly data and software engineering work as mentioned in the post and we are still in the early stages.

I think there is some magic in the applications of ML models to actual business questions which is also a topic that I want to post about on the same blog.

It also wasn't meant as a way to recruit people, as far as I know, we are not currently explicitly looking to hire more data scientists (but also, of course, always welcome applications). I posted this to HN because people ask many questions about what exactly it is that data scientists do in the industry!

link

achernik 2721 days ago

looks like they are "introducing" some internal component https://mobile.twitter.com/automattic/status/106436688085984...

link

nerdponx 2721 days ago

Then what's the point? Low-key flex to try and attract talent?

link

newaccoutnas 2721 days ago

I came here to post the same thing, had a look around the blog and no link

link

kelvin0 2721 days ago

I started taking the Coursera ML class. Reading this article, something jumped at me:

https://datadotblog.files.wordpress.com/2018/12/Screen-Shot-...

It mentions how it's 'impossible' to separate the data points in cartesian coodinates. Isn't logistic regression exactly the use case for this? Thus making the transformation irrelevant?

Anyone with ML experience have an opinion on this?

link

nerdponx 2721 days ago

No, linear regression does not imply separation.

Yes, this is why we use regression, soft-margin SVM, etc. instead of hard-margin SVM. Because perfect linear separation is unrealistic.

link

kelvin0 2721 days ago

Please note I wrote 'Logistic Regression' and not 'Linear Regression' (as you seem to think).

Logistic Regression based classification (with quadratic theta parameters) would seem to certainly be able to handle the cartesian case (without having to resort to convert into polar coordinates).

link

nerdponx 2721 days ago

I meant to write "logistic", but it's worth noting that logistic regression is a linear model from which you derive a linear decision boundary.

And yes, it can handle it, by finding a "optimal" boundary according to a criterion other than "is it separated or not?". But that's not the point. The data remains inseparable.

And yes, while logistic regression can technically handle this case (by returning a solution and not blowing up), it will perform poorly unless you transform the data, because the decision boundary is still linear.

link

kelvin0 2721 days ago

Really appreciate your feedback, I'll certainly look into your claims in the next few days.

What's your background? Have you 'been' in ML long? Feel free to give me as much details as you feel comfortable with.

Thanks!

link