Hacker News new | ask | show | jobs
by bGl2YW5j 336 days ago
If the model was able to generalise, you’d expect it to output something like “[silence]” or “…”, in response to silence.

Instead, it reverted to what it has seen before (in the training data), hence the overfit.

4 comments

Right, maybe my definition of overfitting was wrong, I always understood it more as trying to optimize for a specific benchmark / use case, and then it starts failing in other areas.

But the way you phrase it, it’s just “the model is not properly able to generalize”, ie it doesn’t understand the concept of silence also makes sense.

But couldn’t you then argue that any type of mistake / unknown could be explained as “overfitting” ? Where do you draw the line ?

I don't think so. Overfitting = the model was too closely aligned to the training data and can't generalize towards *unseen* data. I think it saw "silence" before, so it's not overfitting but just garbage in, garbage out.
Your definition is one, but the one the OP is using is overfitting to training data.
That’s exactly my point: by that definition any incorrect answer can be explained by “overfitting to training data”.

Where do you draw the line between “overfitting to training data” and “incorrect data” ?

> That’s exactly my point: by that definition any incorrect answer can be explained by “overfitting to training data”.

Not really, getting 94381294*123=... wrong, but close within the actual answer, cannot be overfitting since it wasn't in the training data.

> [By] that definition any incorrect answer can be explained by “overfitting to training data”.

No it doesn't, for instance some errors would be caused by under fitting. The data could also be correct but your hyperparameters (such as the learning rate or dropout rate) could cause your model to overfit.

> Where do you draw the line between “overfitting to training data” and “incorrect data” ?

There's no need to draw a line between two explanations that aren't mutually exclusive. They can (as in this case) both be true. Overfitting is the symptom; dirty data is the cause.

I think it's a classification issue.

Silence is never put in the subtitles of a film, since it isn't necessary. The viewers can tell that nothing is being said if there are actors on the screen. And in situations where there are no actors, then there will be a subtitle to indicate what is going on, like "[rock music plays]".

Subtitle authors use this silence to fit in meta information and have done so since the closed captions era.

Proper data cleaning procedures would be to strip this meta data from any subtitle sources. Since this wasn't done, this is fundamentally a classification issue. It may also be an over-fitting issue, but that is secondary to the classification problem.

I think it's a data quality problem first, which might lead to a sort of overfitting as a consequence.

How would the AI know that a series of zero-amplitude audio samples should generate the string "[silence]"?

It can only know that if the vast majority of silent audio segments in the trainser are consistently labelled with that string. But that doesn't seem to be the case: Silence is either not labeled at all, or labeled with all kinds of different markers or labeled with unrelated things, like copyright credits.

So even if the model successfully learns a generalized representation of the concept of "silence", it's not clear at all which of all the different labels it should use for that concept.

So what might happen is that the model then starts to overfit on the tiny variations of the individual silence segments, in a desperate attempt to devise some kind of system behind the all the different "silence" labels - which will of course go wrong spectacularly as such a system doesn't exist. (Or if it does, is entirely accidental and not something that should be learned)

It's actually because it is incapable of recognising when it does not know the answer. It will give you the nearest match, even if that is completely incorrect.