Hacker News new | ask | show | jobs
by gillh 888 days ago
Anyone looking to build a practical solution that involves weighted-fair queueing for request prioritization and load shedding should check out - https://github.com/fluxninja/aperture

The overload problem is quite common in generative AI apps, necessitating a sophisticated approach. Even when using external models (e.g. by OpenAI), the developers have to deal with overloads in the form of service rate limits imposed by those providers. Here is a blog post that shares how Aperture helps manage OpenAI gpt-4 overload with WFQ scheduling - https://blog.fluxninja.com/blog/coderabbit-openai-rate-limit...