|
|
|
|
|
by simonw
505 days ago
|
|
Huh! I had incorrectly assumed that was for output, not input. Thanks! YES that was it: files-to-prompt \
~/Dropbox/Development/llm \
-e py -c | \
llm -m q1m 'describe this codebase in detail' \
-o num_ctx 80000
I was watching my memory usage and it quickly maxed out my 64GB so I hit Ctrl+C before my Mac crashed. |
|
1M tokens will definitely require a lot of KV cache memory. One way to reduce the memory footprint is to use KV cache quantization, which has recently been added behind a flag [3] and will 1/4 the memory footprint if 4-bit KV cache quantization is used (OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve)
[1] https://arxiv.org/pdf/2309.06180
[2] https://github.com/microsoft/vattention
[3] https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...