| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jerpint 1489 days ago
	The basic idea is to have a q,k,v cache of all the previously seen tokens that gets updated over time. The transformer can decide to do self-attention (and ignore the cache) or focus on elements from the cache (enabling it to attend to previously seen tokens). They mainly apply this to large documents, i'd be very curious to see a followup on time-dependent tasks like videos