Y
Hacker News
new
|
ask
|
show
|
jobs
by
edg5000
81 days ago
So limiting max context length also reduces VRAM needs a bit? If cache is 20% of total, 1/10th of context as a limit would mean 18% total memory reduction.
1 comments
valine
81 days ago
Yup exactly, in principle it helps with both inference speed by reducing memory bandwidth usage and also reduces the memory footprint of your kvcache.
link