| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by etaioinshrdlu 2662 days ago

First of all, my experience is that the bottleneck is pure NN computation on either CPU or GPU, and what the server is written in has a negligible effect on performance. The bottleneck is not the web server, it's the raw computation. So right away your claim that your backend is written in C++ and therefore is fast makes not a lot of sense.

Then, you have caching. I actually fail to see how any caching at all is useful on a CPU bound task when you have unique inputs each time. This is just not something that is cacheable!

Batching may be one thing that can be helpful --- but typically requires deep modification of the model itself to support it, and no mention is made of that. Furthermore batching may help throughput but may make latency WORSE as you need to wait for multiple inputs before firing off a batch of computation.

Then you fail to specify whether your model will run on a GPU or CPU, and what type / core count thereof.

So, a lot of this just doesn't make much sense from a computer science perspective. Add in the free pricing with no limits and you've got a eyebrow-raising product!

1 comments

avin_regmi 2662 days ago

1. Caching the input will save lots of time. Inputs are not unique each time. In a production environment, lots of inputs are the same. Many platforms in fact will do caching such as Algorithmia, TF Serving, and Sagemaker. If a time to do a search in Redis database is faster than forward pass, caching will reduce time dramatically. Watch my youtube video where I give an example.

2. It's up to you if you want to use it in GPU or CPU. Benchmark was done in a CPU but you're free to download panini via Helm and use GPU in your private kubernetes.

3. For now, during beta testing, we're offering free inference and there is a limit of model size cannot exceed over 2GB.

Hope this was helpful.

link

malux85 2662 days ago

I don’t know what sort of production you’ve been exposed to, but the inputs to a Deep Net are almost never the same.

We have hundreds of models, across many domains, real estate, energy prediction, time series crypto, video analytics, molecular modelling.

I would bet money that across the millions of predictions that we make weekly, over all of the models, no two inputs are the same.

That’s kind of the point of Deep Learning - high dimensional noisy input

Caching will not help you here

link

avin_regmi 2661 days ago

It really depends on the application. Such as content recommendation, prediction of popular items are requested frequently. We maintain prediction cache so we can serve the frequent cache without passing into the model. We also use cache for selecting a model. To do this we join the original prediction with the feedback it receives. Feedbacks are received soon after the prediction, even unique query can benefit from a cache. Most of the prediction models are not Deep learning these days. Most companies are using classical machine learning. In our case, we trained SVM in SciKit learn feedback throughput of 1.8x. We have a simple LRU eviction for cache and use normal cache eviction algorithm.

link

etaioinshrdlu 2661 days ago

The point we are trying to make is that we don't think caching adds much value to the product. It is very easy to implement and doesn't help much.

link

avin_regmi 2661 days ago

I'll take your feedback into consideration :)

link