| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kanyesrthaker 1751 days ago

Hi, Kanyes here from Ferret. Starting the discussion by sharing an unsolved technical hurdle that may be of interest. We made a decision early in development to perform all inference on CPU to avoid unfriendly production costs and inefficiencies processing single inputs instead of batches.

Sequential models like T5 tend to be large (300mb >), and we observed high latency per inference of approx 8s. We've masked this latency on the frontend, mainly sending concurrent requests with async code (4 at a time) and preloading content early. However, this is kind of hacky and we'd (ideally) want to reduce inference time.

To this end, we've demonstrated roughly 1.7x speedup by converting our model weights in pytorch to a quantized ONNX graph. However, we've found a lot of friction in trying to deploy ONNX graphs to AWS. We understand there are a variety of potential solutions (training smaller distilled models, deploying ONNX, contesting our rationale to use CPU etc), so we're looking for suggestions for the optimal method to make inference faster!