Hacker News new | ask | show | jobs
by aik 920 days ago
Do you have an example of where these methods still produce good summaries? Eg if you adjust how re-computation of self-attention in autoregressive decoding / between token generations works to significantly decrease the amount of computation needed?