|
|
|
|
|
by sqreept
846 days ago
|
|
Tried inference with the 7B model and without flash attention this is soooooo slow. With flash attention the fine-tunning requires A100 or H100.
Also the inference doesn't always stop generating resulting in garbage being added to the response. |
|
That sounds like a chat format misconfiguration.
This could partially be Google's fault, as they used yet another novel prompting format.
Also, for sane inference speed on H100s, you'll have to wait for architecture support from the optimized frameworks. Vanilla transformers is beyond awful even with FA2.