Hacker News new | ask | show | jobs
by brucethemoose2 876 days ago
I've had less luck with Mixtral, but I run Yi 34B finetunes for general personal use, including quick queries for work.

Its kinda like GPT 3.5, with no internet access and slightly less reliable responses, but unrestrained, much faster and with a huge (up to 75K on my Nvidia 3090) usable context.

Mixtral is extremely fast though, at least at a batch size of 1.

1 comments

Which Yi 34B finetunes are you using that have a 75,000 token length?
All of the Yi 200K finetunes should support it, but you have to be careful because some degrade the base model's quite excellent long context performance more than others. The very strong Bagel 34B DPO model, for instance, basically doesn't work at long context.

Nous Capybara is a popular one. I personally use my own merge of many models, and you can look through the constituent models to see if any interest you: https://huggingface.co/brucethemoose/Yi-34B-200K-DARE-megame...

You can't really use llama.cpp for super long context btw, its just too slow and vram inefficient at the moment.