|
|
|
|
|
by brucethemoose2
852 days ago
|
|
> Also the inference doesn't always stop generating resulting in garbage being added to the response. That sounds like a chat format misconfiguration. This could partially be Google's fault, as they used yet another novel prompting format. Also, for sane inference speed on H100s, you'll have to wait for architecture support from the optimized frameworks. Vanilla transformers is beyond awful even with FA2. |
|