Hacker News new | ask | show | jobs
by nodja 698 days ago
Somewhat inefficient for text, very inefficient for images, specially if you work in pixel space. The max context a model today has been trained is 1M tokens, which takes up a lot of memory. Even if context was not an issue, to generate a 1000x1000 image would take ~3 hours on 100token/s inference.

Google has trained an encoder/decoder LLM on bytes called ByT5[1]

[1] https://huggingface.co/google/byt5-xxl

1 comments

I think the work on multi-token prediction[0] within a single turn could be a significant development that makes byte-level tokenization models more practical. This approach allows the model to predict multiple tokens in parallel, potentially addressing the efficiency concerns raised about byte-level models.

By predicting multiple tokens simultaneously, it could significantly speed up inference time, especially for tasks that require generating large amounts of data (like images). This could help mitigate the performance bottleneck mentioned in the parent comment about generating a 1000x1000 image.

[0] https://ar5iv.labs.arxiv.org/html/2404.19737