|
|
|
|
|
by gillh
888 days ago
|
|
Anyone looking to build a practical solution that involves weighted-fair queueing for request prioritization and load shedding should check out - https://github.com/fluxninja/aperture The overload problem is quite common in generative AI apps, necessitating a sophisticated approach. Even when using external models (e.g. by OpenAI), the developers have to deal with overloads in the form of service rate limits imposed by those providers. Here is a blog post that shares how Aperture helps manage OpenAI gpt-4 overload with WFQ scheduling - https://blog.fluxninja.com/blog/coderabbit-openai-rate-limit... |
|