|
|
|
|
|
by Hackbraten
1223 days ago
|
|
I used to work in a project where every other rolling upgrade (Amazon’s managed Kafka offering) would crash all the Streams threads in our Tomcat-based production app, causing downtime due to us having to restart. The crash happens 10–15 minutes into the downtime window of the first broker. Absolutely no one has been able to figure out why, or even to reproduce the issue. Running out of things to try, we resorted to randomly changing all sorts of different combinations of consumer group timeouts, which are imho poorly documented so no one really understands which means which anyway. Of course all that tweaking didn’t help either (gunshot debugging never does). This has been going on for the last two years. As far as I know, the issue still persists. Everyone in that project is dreading Amazon’s monthly patch event. |
|