Hacker News new | ask | show | jobs
by deckar01 1022 days ago
The docker container fails installing flash-attn… but honestly a giant API container on top of a custom model generation framework loses all the benefits of Torch’s standard interfaces. It doesn’t really matter how optimized your model runtime is if it’s cemented into a synchronous monolith. The metric that should be optimized is time to first decoded token, because that is how speed is perceived by humans reading the output.
1 comments

Can you share details of the build failure on the github? We'll try to help.

The inference code is shared as a proof of concept, it is not meant to be a production ready deploy. Also worth noting that not all LLMs are used to produce text which is read by humans.

https://github.com/persimmon-ai-labs/adept-inference/issues/...

It’s funny you say production, because all of the errors I ran into suggest the container is expecting your production architecture.

My advice is stream first then make synchronous convenience wrappers on top of that. Also, lean on community standards for PoC. I’m guessing your investors are interested in making this scale as cheaply as possible, but that is probably the least important feature for people evaluating your model’s quality locally.