Hacker News new | ask | show | jobs
by DARSHANFOFADIYA 108 days ago
As we scale to 1MN context length (inference) the biggest bottleneck is memory and to tackle that at scale we pay the price of communication overhead. Now fortunately the gpus are smartly fetching data for the next step while the previous step is computing thus masking the communication overhead and keeping responses at such scale appear realistic.

The quality degradation as context length increaes is a whole another science problem