| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dweinus 407 days ago
	I don't want to hate, what you built is really cool and should save time in a data scientist's workflow, but... we did this. It won't "automate most of the ML lifecycle." Back in ~2018 "autoML" was all the rage. It failed because creating boilerplate and training models are not the hard parts of ML. The hard parts are evaluating data quality, seeking out new data, designing features, making appropriate choices to prevent leakage, designing evaluation appropriate to the business problem, and knowing how this will all interact with the model design choices.

2 comments

impresburger 407 days ago

Hey, one of the authors here! I completely agree with your comment. Training ML models on a clean dataset is the "easy" and fun part of an ML engineer's job.

While we do think our approach might have some advantages compared to "2018-style" AutoML (more flexibility, easier to use, potentially more intelligence solution space exploration), we know it suffers from the issue you highlighted. For the time being, this is aimed primarily at engineers who don't have ML expertise: someone who understands the business context, knows how to build data processing pipelines and web services, but might not know how to build the models.

Our next focus area is trying to apply the same agentic approach to the "data exploration" and "feature ETL engineering" part of the ML project lifecycle. Think a "data analyst agent" or "data engineering agent", with the ability to run and deploy feature processing jobs. I know it's a grand vision, and it won't happen overnight, but it's what we'd like to accomplish!

Would love to hear your thoughts :)

link

lamename 406 days ago

> this is aimed primarily at engineers who don't have ML expertise: someone who understands the business context, knows how to build data processing pipelines and web services, but might not know how to build the models.

I respect software engineers a lot, however ANYONE who "doesn't know how to build models" also doesn't know what data leakage is, how to evaluate a model more deeply than simple metrics/loss, and can easily trick themselves into building a "great" model that ends up falling on its face in prod. So apologies if I'm highly skeptical of the admittedly very very cool thing you have built. I'd love to hear your thoughts.

link

impresburger 406 days ago

I think you're probably right. As an example of this challenge, I've noticed that engineers who don't have a background in ML often lack the "mental models" to understand how to think about testing ML models (i.e. statistical testing as opposed to the kind of pass/fail test cases that are used to test code).

The way I look at this is that plexe can be useful even if it doesn't solve this fundamental problem. When a team doesn't have ML expertise, their choices are A) don't use ML B) acquire ML expertise C) use ChatGPT as your predictor. Option C suffers of the same problem you mentioned, in addition to latency/scalability/cost and the model not being trained on your data etc. So something like Plexe could be an improvement on option C by at least addressing the latter pain points.

Plus: we can keep throwing more compute at the agentic model building process, doing more analysis, more planning, more evaluation, more testing, etc. It still won't solve the problem you bring up, but hopefully it gets us closer to the point of "good enough to not matter" :)

Would love to hear your thoughts on this.

link

janalsncm 406 days ago

Just a thought, but maybe a good angle would be to interview data analysts and ask them what the most annoying parts of their jobs are, to figure out how to automate the drudge work. If you can make their lives easier, they’ll sell the product for you.

link

vaibhavdubey97 406 days ago

Absolutely! When we started building this out, we knew that we had to build an agent to perform data cleaning and feature transformations. After speaking to data analysts, PMs and engineers over the last few weeks, we've received strong feedback about adding this capability to Plexe and we're actively working on it. We've already added some features related to this and hopefully will roll out the whole agent very soon!

link

janalsncm 406 days ago

Yes, this is the issue. In any reasonably-sized enterprise you’re not going to have a clean CSV to plug in to a model generator. You’re either going to have 1) 50 different excel spreadsheets to wrangle and combine somehow or 2) 50+ terabytes of messy logs to process.

Creating something that can grok MNIST is certainly cool, but it’s kind of the equivalent of saying leetcode is equivalent to software engineering.

Second, and more practically speaking, you are automating (what I think of as) the most fun part of ML: the creativity of framing a problem and designing a model to solve that problem.

link

vaibhavdubey97 406 days ago

Agree completely. We built Plexe with that first scenario in mind - the messy spreadsheet problem that's so common in enterprise. You can connect multiple data sources, and Plexe will identify what it needs based on the problem description. We're also gradually developing support for handling terabyte-scale data, though we're not there yet. We started by validating our approach on well-defined problems with clean datasets, but we've been systematically adding capabilities to handle increasingly complex scenarios.

On your second point about automating the "fun part", we see Plexe as amplifying that creativity rather than automating it. We're trying to make it easier to design the experiments and evaluating results. But would love to hear your feedback on this!

link