Hacker News new | ask | show | jobs
Show HN: Panini AI – A platform to serve ML/DL models at low latency (panini.ai)
29 points by avin_regmi 2660 days ago
3 comments

This claim (3x faster than TF serving) and the metrics on the site (~500 predictions per second vs ~200 for TF serving) seem more a function of scaling than any technology.

Given that you can horizontally scale model prediction infinitely the only sensible way to compare is to include price.

I agree that this looks compelling while it is free! But will it be price competitive later?

And if price competitiveness is claimed, then how is it possible? Yes, you can do the whole spot instance thing, but that is difficult to make reliable enough at scale.

Hey, both prediction for TF serving and panini serving was done in a single thread in the same specification machine. We used a simple model for image classification of CIFAR dataset. Roughly, 500 predictions were made for panini and 200 predictions for TF serving.

You can always download the entire panini in your own private server and not pay anything. Ie. used Helm to install in your own kubernetes or DockerHub. For now, We're making it free for models under 2GB. Our main goal is to make it usable and we don't want cost to be a factor.

So you claim that TF Serving (written in C++ I believe) has over double the overhead compared to Panini?

This seems surprising. What makes it so much faster?

Edit: Unless of course you are hitting the cache for a lot of the predictions?

Optimized TF serving would perform similarly to Panini however, it's really hard to find good documentation on optimizing TF serving compilation parameters. Panini automatically finds the right batch size to maximize the throughput and it adaptively changes. We also have a technique to reduce bound tail latency. I would love it for you to try it and provide me some feedback. Thanks
Is it actually 3x faster, or does it just scale more?

In other words, if I had a model that previously took three seconds to get a response from would this platform respond in one second?

Sorry, I should've been more clear. Both predictions for TF serving and panini serving was done in a single thread in the same specification machine. We used a simple model for image classification of CIFAR dataset. Roughly, 500 predictions were made for panini and 200 predictions for TF serving. The graph on the website is for throughput. I'm planning to write a medium post soon regarding the benchmark test.
We changed the title from "Show HN: 3x Faster Than Tensorflow Serving" to what the page says, which is less baity.

https://news.ycombinator.com/newsguidelines.html

That's definitely an improvement, but I'm hoping someone from the Panini team will step in and clarify regardless.
This is so fishy looking on many levels!

It is very hard to believe that deploying your models on GKE is going to be cost saving for anyone involved.

Hey, you don't have to deploy in GKE and it's not GKE that makes it faster. We also give you option to deploy in your own private Kubernetes via Helm or private server via DockerHub. GKE may not be the right option for you depending on your application. Your feedback would be very valuable to us. Please tell me why you think its fishy? We're always tryiing to make it better.
First of all, my experience is that the bottleneck is pure NN computation on either CPU or GPU, and what the server is written in has a negligible effect on performance. The bottleneck is not the web server, it's the raw computation. So right away your claim that your backend is written in C++ and therefore is fast makes not a lot of sense.

Then, you have caching. I actually fail to see how any caching at all is useful on a CPU bound task when you have unique inputs each time. This is just not something that is cacheable!

Batching may be one thing that can be helpful --- but typically requires deep modification of the model itself to support it, and no mention is made of that. Furthermore batching may help throughput but may make latency WORSE as you need to wait for multiple inputs before firing off a batch of computation.

Then you fail to specify whether your model will run on a GPU or CPU, and what type / core count thereof.

So, a lot of this just doesn't make much sense from a computer science perspective. Add in the free pricing with no limits and you've got a eyebrow-raising product!

1. Caching the input will save lots of time. Inputs are not unique each time. In a production environment, lots of inputs are the same. Many platforms in fact will do caching such as Algorithmia, TF Serving, and Sagemaker. If a time to do a search in Redis database is faster than forward pass, caching will reduce time dramatically. Watch my youtube video where I give an example.

2. It's up to you if you want to use it in GPU or CPU. Benchmark was done in a CPU but you're free to download panini via Helm and use GPU in your private kubernetes.

3. For now, during beta testing, we're offering free inference and there is a limit of model size cannot exceed over 2GB.

Hope this was helpful.

I don’t know what sort of production you’ve been exposed to, but the inputs to a Deep Net are almost never the same.

We have hundreds of models, across many domains, real estate, energy prediction, time series crypto, video analytics, molecular modelling.

I would bet money that across the millions of predictions that we make weekly, over all of the models, no two inputs are the same.

That’s kind of the point of Deep Learning - high dimensional noisy input

Caching will not help you here

It really depends on the application. Such as content recommendation, prediction of popular items are requested frequently. We maintain prediction cache so we can serve the frequent cache without passing into the model. We also use cache for selecting a model. To do this we join the original prediction with the feedback it receives. Feedbacks are received soon after the prediction, even unique query can benefit from a cache. Most of the prediction models are not Deep learning these days. Most companies are using classical machine learning. In our case, we trained SVM in SciKit learn feedback throughput of 1.8x. We have a simple LRU eviction for cache and use normal cache eviction algorithm.
It sounds super fishy, especially since you won't answer the question above confirming whether you're talking about latency or throughput.

Google has also dumped a lot into tensorflow serve, so if you are outperforming it by that much it would be great to know how.

Sorry, I should've been more clear. Both predictions for TF serving and panini serving was done in a single thread in the same specification machine. We used a simple model for image classification of CIFAR dataset. Roughly, 500 predictions were made for panini and 200 predictions for TF serving. The graph on the website is for throughput. I'm planning to write a medium post soon regarding the benchmark test. There are many other projects getting higher throughput compare to TF serving. I've heard TF Serving could be optimized to make it more efficient but making it more optimized is not documented properly. We're planning to make it open source if there is enough interest from the community!
What is your business model if your platform is free? Either that price has to change, or you plan on making money on the same thing all other free services run on: data.

The site isn't very upfront about it, which is the sketchy part. Other than that, it looks much more straight forward than other options (I did watch the youtube tutorial). I like the idea, just question the motives.

Our platform is free for the beta users to try it with limit of 2GB per model. We are just starting and we haven't decided on our business model yet.

If a user downloads panini to their private server and use it that will always be free since there is not infrastructure cost for us. If you're deploying it in our website we will be charging you to pay for the infrastracture cost.

Our main goal currently is to find out if people find this product useful and if it's worth for us to spend more time working on it. Thanks for watching the YouTube tutorial and if you have further questions, please contact us. Thanks

I'd definitely be willing to try this if it was open source.
What are you currently using to server ML models?
I would love it if you try and provide me some feedback.