| HN Mirror

This sounds reasonable, but I think in practice the capacity of T1 won't be enough to capture long patterns and the F2 sequence is supposed to help T2 to restore the lost info about the longer pattern. The idea is to make T1 really good at capturing small patterns, like speech in pop music, while T2 would be responsible for background music with longer patterns.

Don't we already do this with text translation? Why not to let one model read a printed text pixel by pixel and the other model produce a translation, also pixel by pixel? Instead we choose to split printed text into small chunks (that we call words), give every chunk a "word vector" (those word2vec models) and produce text also one word at a time.