|
|
|
|
|
by lappa
976 days ago
|
|
"For example, HyperAttention makes the inference time of ChatGLM2 50% faster on 32k context length while perplexity increases from 5.6 to 6.3." "when half of all attention layers are patched (i.e., 14 layers), we verify that most of the tasks do not degrade more than 13%." According to the paper, for most tasks it reduces benchmark scores substantially. Perhaps to the point where a smaller model would yield better inference time and higher benchmarks. However, summarization benchmarks see almost no degredation, great! |
|