Hacker News new | ask | show | jobs
by zatel 2078 days ago
This is so cool!

I know the answer is to just write what I'm describing myself but does anyone know of an existing way to find the best SciKitLearn algorithm for a particular problem. Like if I want to find the regression fit is there a way to just pass in the data and have it trained,tested on all of the regression algorithms in SKLearn? My current workflow is to just pick a handful of algorithms that sound like they should be good for the problem at hand and try each one of them manually. Igel seems like a step towards making this sort of thing possible if another tool doesn't exist already.

3 comments

Hi, we should be careful with the feature you are talking about. The results from all machine learning algorithm can be very misleading and probably some models will overfit the data.

So, if you throw some data and fit all machine learning models on it and then compare the performance. You will probably receive misleading values since different models require different tuning approaches. It's not as easy as you said it, you can't just feed data (also depends on the data) to models and expect to get the best model at the output.

One approach I can think of here is to integrate cross validation and hyperparameter tuning with your suggestion. However, I can imagine that this can be computationally expensive. I will take it into consideration as an enhancement for the tool. Thanks for your feedback

Thank you for explaining this more indepth. I should have been more specific with my original comment, I did intend cross validation and hyper parameter tuning as inclusing to the automatic feature I was describing.

These operations certainly are computationally expensive, a recent hyperparameter tuning operation locked up my laptop for 3 days but this seems to be the case for any similar operation. The only approaches I've come across so far to overcome it are things like converting the data to smaller sizes (which seems outside the scope of this tool) and some way to batch the data so that it can be "paused" and resumed as needed. Thank you again for creating Igel.

Hey, I really appreciate your answer to this question. As I was reading the question, red flags started popping up in my mind about the risk of overfitting when using the ensemble approach, and I think your response was spot on for how an ML researcher would go about it! Most ML professionals I've talked to have been really against making a user friendly ML suite because of how easy it is to misuse these algorithms.
I think you’re looking for something like AutoML by H2O[0]. There are few similar offerings out there if you search around ‘automl’.

[0] https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

thank you, this is the keyword I think I was missing in my searches and now I feel silly for not thinking of it.
Triage is built for this: training and evaluating giant grids of models & hyperparameters using cross-validation. Similar to igel, it abstracts ML to config files and a CLI.

It's designed for use in a public policy context, so it works best with: - binary classification problems (ex: the evaluation module is designed for binary classification metrics) - problems that have a temporal component (the cross validation system makes some assumptions about this)

https://dssg.github.io/triage/

Thank you for sharing this.