Hacker News new | ask | show | jobs
by nl 2661 days ago
This claim (3x faster than TF serving) and the metrics on the site (~500 predictions per second vs ~200 for TF serving) seem more a function of scaling than any technology.

Given that you can horizontally scale model prediction infinitely the only sensible way to compare is to include price.

I agree that this looks compelling while it is free! But will it be price competitive later?

And if price competitiveness is claimed, then how is it possible? Yes, you can do the whole spot instance thing, but that is difficult to make reliable enough at scale.

1 comments

Hey, both prediction for TF serving and panini serving was done in a single thread in the same specification machine. We used a simple model for image classification of CIFAR dataset. Roughly, 500 predictions were made for panini and 200 predictions for TF serving.

You can always download the entire panini in your own private server and not pay anything. Ie. used Helm to install in your own kubernetes or DockerHub. For now, We're making it free for models under 2GB. Our main goal is to make it usable and we don't want cost to be a factor.

So you claim that TF Serving (written in C++ I believe) has over double the overhead compared to Panini?

This seems surprising. What makes it so much faster?

Edit: Unless of course you are hitting the cache for a lot of the predictions?

Optimized TF serving would perform similarly to Panini however, it's really hard to find good documentation on optimizing TF serving compilation parameters. Panini automatically finds the right batch size to maximize the throughput and it adaptively changes. We also have a technique to reduce bound tail latency. I would love it for you to try it and provide me some feedback. Thanks