| HN Mirror

Batch processing scales quadratically with the context size (assuming OpenAI is still using standard transformer architecture) but the batch processing of the prompt is also fast compared to generating tokens because it's batched (parallel). So I wouldn't expect effective response times to go up quadratically. At most linearly, depending on the details of how they implement inference.