Hacker News new | ask | show | jobs
by fc417fc802 104 days ago
> you get the gist of reading the meaning of something when the occasional word is missing,

I think it's more subtle than that. IIUC the tokens were all present for the purpose of computing the output and the score is based on the output. It's only the weight update where some of the tokens get ignored. So the learning is lossy but the inference driving the learning is not.

Rather than a book that's missing words it's more like a person with a minor learning disability that prevents him from recalling anything perfectly.

However it occurs to me that data augmentation could easily break the scheme if care isn't taken.

1 comments

Yeah, it's a bit hard to describe what it happening, because the process doesn't really have a human analogue.

People have a difficult enough time dealing with how loss reduction learning is or isn't 'seeing' the data. Selectively removing things from the loss while sill feeding it all the data takes the non-intuitive situation one layer deeper.

That's partially why I described the hash & masking process. I understand it from a formulaic approach but I don't really feel like I have have a good handle of what is happening semantically. It's like thinking in 5D, you can do the calculations but it still feels like your brain is not equipped to deal with what it means.