|
|
|
|
|
by majke
207 days ago
|
|
> Coordinator sees Node A has significantly fewer rows (logical count) than the cluster average. It flags Node A as "underutilized." Ok, so you are dealing with a classic - you measure A, but what matters is B. For "load" balancing a decent metric is, well, response time (and jitter). For data partitioning - I guess number of rows is not the right metric? Change it to number*avg_size or something? If you can't measure the thing directly, then take a look at stuff like "PID controller". This can be approach as a typical controller loop problem, although in 99% doing PID for software systems is an overkill. |
|
You are right that we need better backpressure. Instead of a smarter coordinator, we probably need 'dumber' nodes that aggressively shed load (return 429s) the moment local pressure spikes, rather than waiting for a re-balance.