| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lappa 1024 days ago

"For example, HyperAttention makes the inference time of ChatGLM2 50% faster on 32k context length while perplexity increases from 5.6 to 6.3."

"when half of all attention layers are patched (i.e., 14 layers), we verify that most of the tasks do not degrade more than 13%."

According to the paper, for most tasks it reduces benchmark scores substantially. Perhaps to the point where a smaller model would yield better inference time and higher benchmarks.

However, summarization benchmarks see almost no degredation, great!

1 comments

janalsncm 1023 days ago

Smaller models will likely not have 32k context windows.

link

brucethemoose2 1023 days ago

Why is that?

link

keonix 1023 days ago

I assume it's because such large context takes lots of memory, so you might as well have smarter model if you are not gonna fit in small vram anyway

link

brucethemoose2 1023 days ago

Personally, I have found that Mistral 7B (with its native 8K context, and decent results stretched out even more) is performing much better than llama 13B tunes for storytelling, where that long context is really important.

And I think the optimized backends should implement that sliding 16k context soon...

Anyway, point is a huge context really helps certain types of queries, and VRAM usage is reasonable with a 7B model.

link