| HN Mirror

I think the work on multi-token prediction[0] within a single turn could be a significant development that makes byte-level tokenization models more practical. This approach allows the model to predict multiple tokens in parallel, potentially addressing the efficiency concerns raised about byte-level models.

By predicting multiple tokens simultaneously, it could significantly speed up inference time, especially for tasks that require generating large amounts of data (like images). This could help mitigate the performance bottleneck mentioned in the parent comment about generating a 1000x1000 image.

[0] https://ar5iv.labs.arxiv.org/html/2404.19737