| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by visarga 1269 days ago
	It's not complicated to explain. The model can handle 4000 tokens at once. So all you can do is work with the limitations of this window. You can use part of it to quote the previous interactions, and part of it for the response. If your content is too large, you need to summarise it. There are AIs for that too. If the output is too large, you need to split it in multiple rounds. It is pretty hard to work around this limitation, for example to write a whole novel. I think we need LLMs capable of reading a whole book at once, about 100K tokens. I hope they can innovate the memory system of transformers a bit, there are ideas and papers, but they don't seem to get attention.

2 comments

dpflan 1269 days ago

Is there a "law of tokens" growth for LLMs, ala Moore's Law, but for LLM capabilities based upon token capacity?

link

visarga 1269 days ago

Complexity is quadratic in sequence length. For 512 tokens it is 262K, but for 4000 tokens it becomes 16M and goes OOM on a single GPU. We need about 100K-1M tokens to load whole books at once.

Since 2017 there have been hundreds of attempts to bring O(N^2) to O(N), but none of them replaced the vanilla attention yet in large models. They lose on accuracy. Maybe Flash attention has a shot (https://arxiv.org/abs/2205.14135).

link

Workaccount2 1269 days ago

Sure, that is chatGPT in late 2022.

What about Open.ChatGPT in mid 2024?

link