|
|
|
|
|
by moffkalast
680 days ago
|
|
Well that and you also need a fair bit more space for the KV cache which can be a bit unpredictable. Models without GQA, flash attention or 4 bit cache support are really terrible in that regard, plus it depends on context length. Haven't found a good rule of thumb for that yet. |
|