Hacker News new | ask | show | jobs
by pplonski86 1950 days ago
What type of ML algorithms do you support? Do you have benchmarks with performance?
3 comments

Regrading benchmarks, we have three main dataset collections we focus on currently:

1. Datasets from customers, but obviously those can’t be made public.

2. The OpenML benchmark, which is fairly limited because it’s mainly binary categories, but which is good because it’s a 3rd party, so unbiased. We have some intermediary results here (https://docs.google.com/spreadsheets/d/1oAgzzDyBqgmSNC6g9CFO...) , they are middle-of-the-road. However I think the benchmark is pretty limited, i.e. it doesn’t cover most of the kinds of inputs and almost none of the output we support

3. An internal benchmark suite which currently has 59 datasets, mainly focused around classification and regression tasks with many inputs, timeseries problems and text. Some part of it is public but opening that up is a bit difficult due to licensing issues. I’m hoping that in the next year it will grow and 90%+ of it can be made public. We benchmarkagainst older versions of mindsdb, against hand made models we try to adapt to the task, against the state of the art accuracy for the dataset (if we can find it) and a few other auto ML frameworks (well, 1, but I hope to extend that list) [see this repo for the ones we made public: https://github.com/mindsdb/benchmarks, but I'm afraid it's a bit outdated]

That being said benchmarking for us is still WIP, since as far as I can tell nobody is trying to build open source models that are as broad as what we're currently doing (for better or worst), and the closed source services offered by various IaaS providers don't really come with public benchmark results outside of marketing.

The benchmarking challenges you are facing are pretty common in the AutoML community. My colleagues and I at Google Research are trying to solve this with https://github.com/google/nitroml. It's still super early days (no CI yet), but I think it could help your team benchmark on a set of open standard benchmark tasks as we open source more of the system.
Looks quite interesting, already pinned this in the relevant slack channel :)

To be honest I'm rather happy with how the internal benchmark suite is turning out, but to some extent you are inviting bias by creating them yourself. On top of that, it doesn't hurt to have more benchmarks.

At the end of the day it's a combination of: * How much work is it to integrate (easy to measure) * How visible is it, i.e if we actually find something interesting will be visible and legible to others (ify to mesure, citations, stars, etc are some invitation) * How useful it is to "improve" the library (hard to measure, and what we aim to be good at is a moving target)

So realistically that's the equation I have to judge in terms of adding a new benchmarks suite, and it's very annoying because you'll note the most important things are the hardest to measure.

Would you want people to integrate with this now or would you rather wait a few weeks/months/years until it matures more? If the former, can you give a few details regrading where to start (README is fairly barren), if the later please ping me (george.hosu@mindsdb.com) when you think it could be ready to try.

Anyway, any open benchmark library is a step in the right direction, thanks for working on this :)

Thanks for your feedback! Based off the description of how you already do things, I'd say you're ahead of the curve as far as rigorous model quality benchmarking. You should absolutely hold off of using nitroml for a few months until it's more mature. It's very much pre-prerelease in a build-in-the-open sense. :) I'll shoot you an email once it's ready for anyone to try out. When the time comes, we'll have a blog post to announce it, and will include proper documentation.

And, congrats on the launch!

The design is modular such that it can support anything under the cover.

Essentially you have encoders for all of the columns, which then get piped into a mixer and then into decoders to predict the final output(s). These encoders and decoders can be any type of ML model, but our current focus is on neural networks.

So e.g. if you have say a text like "A cute cat" and the number 5 and your target is an image (let's assume you have a training set such that the model would learn to generate one with 5 cute cats) then you have:

1. Text encoder generates an embedding for (cute cat) + numerical encoder normalizes "5" 2. A mixer (which can be e.g. an FCNN or gradient booster) generates an intermediate representation. 3. A decoder that is trained to generate images takes that representation and generates an image1.

Note: above is a good illustrative example, in practice, we're good with outputting dates, numerical, categories, tags and time-series (i.e. predicting 20 steps ahead). We haven't put much work into image/text/audio/video outputs

You should be able to find more details about how we do this in the docs and most of the heavy lifting happens in the lightwood repo, the code for that is fairly readable I hope: https://github.com/mindsdb/lightwood

Also worth mentioning, Mindsdb can take input columns of any of the following (numerical, categorical, text, images) and it's getting pretty good at Timeseries problems (for which we support a variety of techniques, including novel approaches to sequential data such as (RNNs, Transformers, CNN tiling, ...). Given the nature of data in databases where there is often a chronological order of transactions we put allot of focus on offer capabilities to make the models time aware.