| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by DARSHANFOFADIYA 155 days ago
	As we scale to 1MN context length (inference) the biggest bottleneck is memory and to tackle that at scale we pay the price of communication overhead. Now fortunately the gpus are smartly fetching data for the next step while the previous step is computing thus masking the communication overhead and keeping responses at such scale appear realistic. The quality degradation as context length increaes is a whole another science problem