|
|
|
|
|
by jakupovic
1061 days ago
|
|
The part about distributing loads takes me back to S3 KeyMap days and me trying to migrate to it, from initial implementation. What I learned is that even after you identify the hottest objects/partitions/buckets you cannot simply move them and be done. Everything had to be sorted. The actual solution was to sort and then divide the host's partition load into quartiles and move the second quartile partitions onto the least loaded hosts. If one tried to move the hottest buckets, 1st quartile, it would put even more load on the remaining members which would fail, over and over again. Another side effect was that the error rate went from steady ~1% to days without any errors. Consequently we updated the alerts to be much stricter. This was around 2009 or so. Also came from academic background, UM, but instead of getting my PhD I joined S3. It even rhymes :). |
|