Hacker News new | ask | show | jobs
Show HN: oLLM – LLM Inference for large-context tasks on consumer GPUs (github.com)
3 points by anuarsh 295 days ago
4 comments

20 minutes is a huge turnoff, unless you have it run over night.... Just to get the hint that you should exercise self care in the morning when presenting a legal paper and have the ai check it for flaws.
We are talking about 100k context here. 20k would be much faster, but you won't need KVCache offloading for it
It's better to have software erase all private details from text and have it checked by cloud ai to then have all placeholders replaced back at your harddrive.
"~20 min for the first token" might turn off some people. But it is totally worth it to get such a large context size on puny systems!
Absolutely, there are tons of cases where interactive experience is not required, but ability to process large context to get insights.
It would be interesting to see some benchmarks of this vs, for example, Ollama running localy with no timeout
Hi everyone, any comments or questions are appreciated