| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by graedus 2325 days ago
	It's scary not because this specific dataset was used to train Teslas that are on the road today. Rather, because it makes us aware of an entire class of errors that most of us probably hadn't thought about before. I guess you are absolutely certain that training data used in production cars will be free of these issues, but it's not clear why.

1 comments

tomnipotent 2325 days ago

It does not make use aware of a new class of errors. Labeling issues is nothing new, but plenty of systems trained on them continue to work just fine. This is FUD.

link

jwandborg 2324 days ago

Is there any statistical/mathematical tool to completely eradicate or greatly diminish the effects of bad labeling? Is there any reason - other than the combination of pure circumstance and gut feeling of the Data Scientist in charge of saying that it's good enough to deploy - that ~33% insanity in training doesn't become ~33% insanity in the system?

link

neurobro 2324 days ago

Not sure about bad labels, but semi-supervised learning is the term for training on data with a lot of missing labels. Essentially the algorithm makes predictions on the unlabeled data and uses its highest confidence predictions as additional training data. Generative models can also "dream up" entirely new training examples. There is a risk of amplifying the confidence in bad predictions, but it works well overall (better than using only the labeled portion of the data).

link

tomnipotent 2324 days ago

> Is there any statistical/mathematical tool to completely eradicate or greatly diminish the effects of bad labeling

Yes, it's called statistics and probability theory.

link

jwandborg 2324 days ago

> Yes it's called statistics and probability theory.

My understanding of statistics is:

- I can halve the % insanity by adding another 100% of good labels.

- If I want to reduce the insanity of labels to 1/33th of ~33% I need to add another 3200% of good labels.

- If I want to reduce the insanity to 0% I need to balance the bad labels with an infinite amount of good labels.

Is there anything I'm missing entirely except probability theory? Is probability theory the answer or is there something else?

link

tomnipotent 2324 days ago

You don't reach 0%, that's a straw man. The goal is better than human, and the 35,000+ vehicle-related fatalities that happen in the U.S. each year.

link

perl4ever 2324 days ago

There's a disconnect here.

People who talk about the danger of humans driving cars always seem to talk about the raw numbers, because humans drive cars a lot and the raw numbers are rather large.

But when we talk about automated driving, it's in percentages, because it's not being done on the same scale.

So to compare apples to apples, you'd have to convert the number of fatalities to an accuracy percentage. Have you considered trying? There is certainly more than one way to do it, but it would greatly contribute to the discussion if you made some attempt.

link

jwandborg 2322 days ago

It's hard to reach 0% bad labels because:

1. You can't have an infinite amount of good labels 2. Humans are in charge of labeling too.

The question is if you can reliably overcome the number of bad labels in your training set, so that 33% of bad labels equates to <33% "insanity" in the system.

link

diffeomorphism 2324 days ago

Your understanding is wrong for anything nonlinear. The whole reason machine learning is useful is because it is nonlinear.

link

jwandborg 2322 days ago

How nonlinear are we talking? My understanding is probably closer to the truth than to the opposite of the truth. I'm looking for an estimate of how far from the truth I am.

How would a system reliably discredit missing labels while still learning from good labels? The simplest solution would be that system is able to spot the bad/missing labels itself with some certainty, but that seems like a catch 22.

link

jwandborg 2324 days ago

That's correct. I know what goes in and what comes out, not what happens in the middle. How does ~33% insanity in become < ~33% insanity out?

Edit: Parent was edited, was previously (paraphrased)

> I'm guessing you have no technical understanding of how this works

link

tomnipotent 2324 days ago

How does making up something ridiculous like "33% insanity" give you anything that's resembles a subject that we can discuss? Hyperbole in, hyperbole out.

link

jwandborg 2322 days ago

I'm 33% insane myself. I believe that's part of what makes me human.

link

jasonwatkinspdx 2324 days ago

We have a lot more than the gut feeling you're assuming: https://arxiv.org/abs/1611.03530

link

GordonS 2324 days ago

Maybe we could use deep learning? Oh, wait...

link