| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by digitalzombie 2753 days ago

More data is better.

You can reduce it via PCA one of the many techniques in multivariate statistic.

You can do anova to select your predictors.

In general you can use a subset of it using the tools that statistic have provided.

Complaining about messy data... welcome to the real world. As for complaining about non-reproducible models , choose a reproducible ones. I've only done mostly statistical models and forest base algorithms and they're all reproducible.

All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?

> The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns

CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.

6 comments

YeGoblynQueenne 2753 days ago

>> All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?

The solution is to direct research effort towards learning algorithms that generalise well from few examples.

Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.

>> You can reduce it via PCA one of the many techniques in multivariate statistic.

PCA is a dimensionality reduction technique. It reduces the number of featuers required to learn. It doesn't do anything about the number of examples that are needed to guarantee good performance. The article is addressing the need for more examples, not more features.

link

nimithryn 2753 days ago

>>>Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.

This is only true for the Facebooks and Googles of the world. There are definitely small companies (like the one I work for) trying very hard to figure out how to build models that use less data because we don't have access to those large datasets.

The industry is larger than just the Big N.

link

YeGoblynQueenne 2753 days ago

Btw, if you have relational data and a few good people with strong computer science backgrounds rather than statisticians or mathematicians, have a look at Inductive Logic Programming. ILP is a set of machine learning techniques that learn logic programs from logic programs. The sample efficiency is on a class of its own and it generalises robustly from very little data[1].

I study ILP algorithms for my PhD. My research group has recently developed a new technique, Meta Interpretive Learning. Its canonical implementation is Metagol:

https://github.com/metagol/metagol

Please feel free to email me if you need more details. My address is in my profile.

___________________

[1] As a source of this claim I always quote this DeepMind paper where Metagol is compared to the authors' own system (which is itself an ILP system, but using a deep neural net):

https://arxiv.org/abs/1711.04574

ILP has a number of appealing features. First, the learned program is an explicit symbolic structure that can be inspected, understood, and verified. Second, ILP systems tend to be impressively data-efficient, able to generalise well from a small handful of examples. The reason for this data-efficiency is that ILP imposes a strong language bias on the sorts of programs that can be learned: a short general program will be preferred to a program consisting of a large number of special-case ad-hoc rules that happen to cover the training data. Third, ILP systems support continual and transfer learning. The program learned in one training session, being declarative and free of side-effects, can be copied and pasted into the knowledge base before the next training session, providing an economical way of storing learned knowledge.

link

nimithryn 2753 days ago

Ah yes I am very familiar with ILP - thanks for sending these references!

link

YeGoblynQueenne 2753 days ago

You're welcome, and what a pleasant surprise, it's rare to find people who know about ILP in the industry :)

link

YeGoblynQueenne 2753 days ago

You're absolutely right and I appreciate that very much. On the other hand, there's an incredible amount of hype around Big Data and deep learning, exactly because the large corporations are doing it. So now everyone wants to do it, whether they have the data for it or not, whether it really adds anything to their products or not.

As to the Big N (good one) what I meant to say is that I don't see them trying very hard to undo their own advantage, by spending much effort developing machine learning techniques that rely on, well, little data. That would truly democratise machine learning- much more so than the release of their tools for free, etc. But then, if everyone could do machine learning as well as Google and Facebook et al, where would that leave them?

link

spongepoc 2753 days ago

>CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.

Yes it does. It even implies it in the name 'limit'. In the limit of infinitely many samples, we approximate a normal distribution. This approximation has diminishing returns.

>All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?

It's fine to point out problems without giving solutions. You seem very aggravated.

link

b_tterc_p 2753 days ago

PCA has specific use cases. It’s not a catch all dimensionality reduction technique. You can’t use it effectively, for example, if things are not linearly correlated. There of course many tools for addressing many problems, but as the title states, this is often a grind. For any practical problem, exclusive of huge black box neural nets where you don’t need to understand the model, you are probably better off starting with a smaller set of reasonable sounding features and then slowly growing out your model to incorporate others.

Also if you meant random forest by forests... those aren’t especially reproducible. Understanding what’s going on is not always easy, and most people seem to misinterpret the idea of “variable importance” when you have a mix of categorical and numeric features. Decision trees and linear regressions are nice and reproducible.

link

apercu 2753 days ago

> Complaining about messy data... welcome to the real world.

I mean, that's the crux is if you have bad data you will have bad results. Data cleanup/transformation is key for anything (reporting, etc...) and not just limited to ML because it's sexy these days.

link

Breza 2748 days ago

Nice to see a statistician weighing in on this post

link

iagovar 2753 days ago

Thank you, Im not crazy. I was reading HN very confused.

link