That is really interesting. If it's not too much trouble to write out, could you explain what causes the latency difference between kernel wake-up and other thread wake-up?
Paul Turner explained this really well at this year's Linux Plumbers Conference. The whole talk is fantastic, but the explanation of what pron is describing in particular (and how it could be improved) starts around 8:39: https://www.youtube.com/watch?v=KXuZi9aeGTw#t=519
I honestly don't know :) I was simply reporting my results experimenting with this (I'll try to write a blog post about it some time in the near future), so I'll defer to those with a deeper knowledge of the Linux kernel.
I have read that the Linux scheduler exploits some heuristics if it can guess how soon a blocked thread will need to be woken up, so this might have something to do with that.