|
|
|
|
|
by mrsilencedogood
887 days ago
|
|
My experience is that Linux is rock solid as long as you're not running it on super duper expensive hardware and doing crazy-big things on it. Randomly in my career so far, notable kernel panic causes were: - when a spark job finishes and deallocates close to a TB of memory, kernel panic. jobs using below 750GB were typically not seeing this happen, so it was something in there. this just kind of stopped happening after we updated the kernel and spark in a semi-unrelated push, so never really got a root cause here. - bad hardware - a spark job that was doing simply insane amounts of shuffle output (which goes to disk) was hitting kernel panics which ended up being related to a kernel bug that only impacted ridiculously high-disk-io-using applications, with some additional spin that made me think "ah so this is basically only affecting spark jobs" - bad hardware Did I mention bad hardware? I've spent way too much time hunting down "bugs" that ended up just being a bad mobo and linux was kind enough to inform you of it. But "this is the only program that causes the kernel panics!" and yet when we move it to a temp server for a few days the program mysteriously stops crashing. Another reason I do like "the cloud" - I can just cycle out an ec2 box I suspect is bad instead of fighting with the IT guy about whether the 2 year old expensive server is already busted or not. |
|