Hacker News new | ask | show | jobs
by malux85 880 days ago
I love ollama, the engine underneath is llama.cpp, and they have the first version of self-extend about to me merged into main, so with any luck it will be available in ollama soon too!
1 comments

A lot of the new models coming out are long context anyway. Check out Yi, InternLM and Mixtral.

Also, you really want to wait until flash attention is merged before using mega context with llama.cpp. The 8 bit KV cache would be ideal too.