Hacker News new | ask | show | jobs
by yeldarb 942 days ago
Pretty neat! We've been using Lambda for ML serving low-volume CV models (and my understanding is AWS' Sagemaker Serverless is a lambda wrapper) for a couple of years at Roboflow and it is really good for low-volume and bursty use-cases. The latency is surprisingly not bad. It gets really expensive relative to GPUs for high load (and especially predictable high-load like monitoring security cameras 24/7) though so we end up with our biggest enterprise customers running things in a Kubernetes cluster.

There are a few serverless GPU companies like Banana.dev and Modal; I really want to give them a shot. Anyone have experience using them in prod?

1 comments

We've been building with Modal over the past few months (though no prod-scale tests yet) and were slightly disappointed by very large (10-20 second) cold start times. In the long term we're more interested in inference servers that use compiled/optimized models instead of running plain old PyTorch (which adds another few seconds to cold start on its own).
We are adding support for inference servers to Pipeless. We started by the ONNX Runtime, and OpenVINO, CoreML, CUDA and TensorRT execution providers. Some people mentioned me to integrate also with the Triton server, however I still need to deep into that and check its license. The good part is, there is no cold start right now, at the cost of having some resources allocated from the node start.