I would love to see what's really causing this bug. We read so many times over the weekend to either reboot or just run that date command - but nobody is telling us what's causing the problem.
Also, seeing that other threaded applications had similar problems, I doubt this is a java issue - more likely a pthread, glibc or even kernel issue
Apparently the issues might be due "to the leapsecond being added without calling clock_was_set() to notify the hrtimer subsystem of the change", a possible fix being to patch kernel/time/timekeeping.c to be leapsecond aware.
That's predominantly about the kernel crash, not the high-CPU futex issue. One of the most maddening things about this is that there have been several different issues related to leap seconds on Linux, making it all the harder to get information.
Hard to call this a Java bug when many other, non-Java things are affected. It's a critical Linux bug that causes futex to timeout, and anything that uses it to behave incorrectly.
It's probably that Java heavily utilizes the multi-thread support and the kernel bug is showing up as a Java bug. It just means Java really exercises the system's concurrent support.
ecopoesis, you are not the only one saying that it's a linux bug instead of a java bug even though the link title says "Critical Linux bug that leads 100% CPU (leap second)".
Did the link title change from a Java title, like the article, to a Linux title to match the actual root cause?
Thank you for that link! I had been scratching my head about that server even though it wasn't mine to take care of (the other service I'm involved with here, that I helped plan, uses Postgres, which does not seem to have problems).
I saw what is likely a related issue on one of our AWS EC2 instances, where exactly at midnight UTC there was a high percentage of 'steal' CPU time in our server monitoring charts.
I wonder if this was caused by another VM on the same physical box being hit by the bug and as a result stole CPU time from our VM.
I resolved the issue by moving to a different VM (Rebooting didn't help), to get away from my greedy neighbor.
My rig crashed all weekend because of this POS bug, I had to boot back to Windows to get anything done (oh cmd, I really didn't miss you at all you insufferable bitch...)
So if the leap second was handled in userspace instead of the kernel, just like a normal ntp time update, all would have been fine. Why not just do that?
On Sunday I noticed that Gerrit (code review, written in Java) was chewing through CPU on one of our servers. Just applied this it appears to have settled down.
Also, seeing that other threaded applications had similar problems, I doubt this is a java issue - more likely a pthread, glibc or even kernel issue