Hacker News new | ask | show | jobs
by davmre 518 days ago
They're not proposing to apply tensor decomposition to an existing collection of weights. It's an architecture in which the K, V, and Q tensors are constructed as a product of factors. The model works with the factors directly and you just need to compute their product on the forward pass (and adjoints on the backwards pass), so there's no decomposition.