Hacker News new | ask | show | jobs
by machinelearning 613 days ago
This is a good problem to solve but the approach is wrong imo.

It has to be done in a hierarchical way to know what you attended to + full context.

If the differential vector is being computed with the same input as the attention vector how do you know how to modify the attention vector correctly

1 comments

Doesn't everything just get tweaked in whatever direction the back-propagation derivative says and proportionally to that "slope"? In other words, simply by having back-propagation system in effect there's never any question about which way to adjust the weights, right?