|
|
|
|
|
by etaioinshrdlu
2662 days ago
|
|
First of all, my experience is that the bottleneck is pure NN computation on either CPU or GPU, and what the server is written in has a negligible effect on performance. The bottleneck is not the web server, it's the raw computation. So right away your claim that your backend is written in C++ and therefore is fast makes not a lot of sense. Then, you have caching. I actually fail to see how any caching at all is useful on a CPU bound task when you have unique inputs each time. This is just not something that is cacheable! Batching may be one thing that can be helpful --- but typically requires deep modification of the model itself to support it, and no mention is made of that. Furthermore batching may help throughput but may make latency WORSE as you need to wait for multiple inputs before firing off a batch of computation. Then you fail to specify whether your model will run on a GPU or CPU, and what type / core count thereof. So, a lot of this just doesn't make much sense from a computer science perspective. Add in the free pricing with no limits and you've got a eyebrow-raising product! |
|
2. It's up to you if you want to use it in GPU or CPU. Benchmark was done in a CPU but you're free to download panini via Helm and use GPU in your private kubernetes.
3. For now, during beta testing, we're offering free inference and there is a limit of model size cannot exceed over 2GB.
Hope this was helpful.