Critical Linux bug that leads 100% CPU (leap second)

Y	Hacker News new \| ask \| show \| jobs

	Critical Linux bug that leads 100% CPU (leap second) (blog.wpkg.org)
	139 points by yekmer 5143 days ago

15 comments

pilif 5143 days ago

I would love to see what's really causing this bug. We read so many times over the weekend to either reboot or just run that date command - but nobody is telling us what's causing the problem.

Also, seeing that other threaded applications had similar problems, I doubt this is a java issue - more likely a pthread, glibc or even kernel issue

link

altxwally 5143 days ago

The patch that was shared on the lkml shows some insight on what is causing the issue. https://lkml.org/lkml/2012/7/1/27

Apparently the issues might be due "to the leapsecond being added without calling clock_was_set() to notify the hrtimer subsystem of the change", a possible fix being to patch kernel/time/timekeeping.c to be leapsecond aware.

link

gaius 5143 days ago

There is a good explanation here: http://serverfault.com/q/403732/58037

link

agwa 5143 days ago

That's predominantly about the kernel crash, not the high-CPU futex issue. One of the most maddening things about this is that there have been several different issues related to leap seconds on Linux, making it all the harder to get information.

link

ajays 5143 days ago

This seems like the best explanation I've found so far: https://lkml.org/lkml/2012/7/1/203

link

pilif 5143 days ago

Agreed. Also it clearly accounts for the futex related load issues and it even gives nice and readable C code to see the problem happening.

This explains it for me. Thanks a lot for the pointer.

link

xxpor 5142 days ago

A good explanation from Reddit: http://www.reddit.com/r/programming/comments/vxmf7/time_arit...

link

ecopoesis 5143 days ago

Hard to call this a Java bug when many other, non-Java things are affected. It's a critical Linux bug that causes futex to timeout, and anything that uses it to behave incorrectly.

https://lkml.org/lkml/2012/7/1/11

link

ww520 5143 days ago

It's probably that Java heavily utilizes the multi-thread support and the kernel bug is showing up as a Java bug. It just means Java really exercises the system's concurrent support.

link

tommi 5143 days ago

ecopoesis, you are not the only one saying that it's a linux bug instead of a java bug even though the link title says "Critical Linux bug that leads 100% CPU (leap second)".

Did the link title change from a Java title, like the article, to a Linux title to match the actual root cause?

link

davidw 5143 days ago

> Did the link title change from a Java title, like the article, to a Linux title to match the actual root cause?

Yes, it did.

link

yekmer 5143 days ago

Our company uses HBase, Elastic Search, GitBlit, SmartFox Server, Jetty which have been by this bug, MySQL is said to be affected too http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-sec...

link

davidw 5143 days ago

Thank you for that link! I had been scratching my head about that server even though it wasn't mine to take care of (the other service I'm involved with here, that I helped plan, uses Postgres, which does not seem to have problems).

link

pjmlp 5143 days ago

This is a Linux kernel bug, not a JVM bug.

link

mcescalante 5143 days ago

Yeah, NTP is Linux kernel, but the JVM is what's eating the CPU after the clock leap.

link

jbellis 5143 days ago

no, it's the kernel livelocking in response to a call made by the jvm

link

jhund 5143 days ago

I saw what is likely a related issue on one of our AWS EC2 instances, where exactly at midnight UTC there was a high percentage of 'steal' CPU time in our server monitoring charts.

I wonder if this was caused by another VM on the same physical box being hit by the bug and as a result stole CPU time from our VM.

I resolved the issue by moving to a different VM (Rebooting didn't help), to get away from my greedy neighbor.

More info here: http://blog.thinrhino.net.in/cpu-steal-time

link

j_col 5143 days ago

So that explains why the 12 cores on my Fedora workstation were maxed-out when I came to work this morning!

link

freestyler 5143 days ago

There is a list of applications affected by this kernel bug http://blog.windfluechter.net/content/blog/2012/07/01/1481-1...

link

JVIDEL 5143 days ago

Oh man so that was causing it!

My rig crashed all weekend because of this POS bug, I had to boot back to Windows to get anything done (oh cmd, I really didn't miss you at all you insufferable bitch...)

Any fixes?

link

gcr 5143 days ago

There's a fix in the article.

link

regularfry 5143 days ago

The easier to type 'sudo date -s "`date`" seemed to work for me.

link

kzrdude 5143 days ago

So if the leap second was handled in userspace instead of the kernel, just like a normal ntp time update, all would have been fine. Why not just do that?

link

e40 5143 days ago

On Sunday I noticed that Gerrit (code review, written in Java) was chewing through CPU on one of our servers. Just applied this it appears to have settled down.

link

coldskull 5143 days ago

well, our hadoop cluster went bonkers because of this bug....luckily it was on stage...not production!

link

danielhlockard 5143 days ago

Yeah, I ended up rebooting our production hadoop cluster, it all came back up fine, and we don't have too many people using it yet.

link

geetee 5143 days ago

Hey, remember that time I spent a couple hours frantically checking logs and restarting services?

link

abc_lisper 5143 days ago

Does this happen on android too?

link

agentgt 5143 days ago

What a PITA

link