I think you bring up the main problem w/ using tcp vegas. It's not clear to me this will work with heterogenous requests. If the typical request time distribution is long tailed, it might never increase the window size.
Even with heterogenous workload there normally is a uniform distribution of request types. Instead of generating complex statistics for average latency or tail latencies, especially for multimodal distributions, we just look at the minimum latencies as a proxy to identify queuing. So, when there is any queuing for whatever reason (increased RPS or latency in a dependent service) all latency measurements will show an increase, especially the minimum.
"uniform distribution of request types" - okay, it makes sense in that context. Although if that assumption breaks down, your thread limits may become under or over provisioned.
I'm wondering though - how do you pick the right alpha and beta values? It seems like you need to do testing/validation to ensure you use the right values, right?
Sorry if I'm sounding critical by the way. I think this is a really cool project - thanks for open sourcing it!