Hacker News new | ask | show | jobs
by psytrx 858 days ago
In addition to the initial loading time noted by the other posters:

You may want to use the same inference engine or even the same LLM for multiple purposes in multiple applications.

Also, which is a huge factor in my opinion, is getting your machine, environment and OS into a state that can't run the models efficiently. It wasn't trivial to me. Putting all this complexity inside a container (and therefore "server") helps tremendously, a) in setting everything up initially and b) keeping up with the constant improvements and updates that are happening regularly.