Hacker News new | ask | show | jobs
by blueplanet200 1542 days ago
I hope they figure out what’s going on every morning. Heard from inside they don’t know why the db dies everyday but restarting it fixes it.
5 comments

What's "the db"? It sounds like something of small to medium scale if you can just restart it like that.

In any case, why not just relocate some vendor engineers on site for a bit? Or, better, why does the vendor not have a small presence in the corner?

Sounds like whatever "the db" is it's probably some (objectively) small but very scary thing that's currently on fire and people are trying to figure out how to put it out without crashing the plane and also making too many waves internally, which is probably even harder. So asking about making vendor noises is (as useful as it may be) probably going down the wrong path - in much the same way this is probably not related to the outages (it may well be, but from the outside it's all coincidence anyway).

Cock crows. DB crashed.

Systemctl restart

mysqld

(Or mariadb, if you pronounce "SQL" as "sequel")

IIS Server had/has a memory leak in worker threads that many years ago always forced us to restart the server every few days. Starting in 6.0, they added worker thread recycling and made it a mandatory to choose a time period for every thread to be recycled. Why fix the error when you can just restart the service?
Apache prefork had that since forever. Seems just a garbage collect type pattern.
For old-school mod_perl apps setting MaxRequestsPerChild was often a much better ROI than actually finding and fixing the leaks.

Speaking as somebody who's done over a decade of large scale OO applications perl and is actually really good at finding and fixing the leaks, this has often been intellectually aggravating but every time I've set that option instead I rewarded myself with a glass of bourbon for picking the pragmatic choice and then went back to adding (non-leaky) features that were far more useful to the company in question than cleaning up the older code would've been.

It's not a bug, it's a pattern.

Seriously though, IIS 5.0 had no worker recycling. There was no method to fix the issue. Threads would eat up GB's of memory until you killed them.

I doubt they use IIS
MSer here, yes we do… for some things
For GitHub? It seems unbelievable that they would use IIS pre-purchase and why in the world would you mix in a second web server for post-purchase enhancements.
Why trade an open source solution with third rate garbage that is called IIS which runs on a sub-par desktop OS called Windows. I thought that Github was supposed to be independant.
If GH is around the same level of integration with Microsoft as my employer, which is another Microsoft acquisition, I don't really believe you have a ton of insight into GH processes.
I dated a girl at GitHub for awhile last year who said they weren’t even completely off of AWS yet and she liked how they didn’t seem like working for Microsoft. Maybe this has changed though.
Break out the early morning restart cron job.
Here you go, Github:

0 4 * * * /etc/init.d/postgresql restart

I'll take an architect position as compensation, but only if there is equity.

GitHub uses MySQL primarily though.
MySQL also has a restart command! I'll take my rsus now ty.
Early morning in which timezone?
GaryOldman.gif
When the least amount of users are online?
How long does restarting it take?
Yuck. Honestly, restarting a database to fix a major outage sounds like "we have no idea what we're doing"
It sounds like "they don't know why it's going down." I've worked with plenty super competent people that have taken time to root cause incidents.

Guide to incidents: Step 1: Stop the bleeding Step 2: Prevent it in the future

Doing Step 1 doesn't make you incompetent.

I'm not a DBA, and maybe you're not a DBA either, so this question goes to DBAs who may be reading: aren't you always better off killing the bad queries instead of rebooting the whole box, if that's an option? (ie: aside from times when the entire host is screwed, load per core is >50, metrics aren't getting out, you can't ssh in etc)
Sporadic database performance issues can certainly make you feel that way. They are definitely not trivially debugged at scale
Would you rather it stay down while they spend a day debugging it?
If that means it won't be down every morning in my time zone then yes.
As long as it's announced in advance so that users/customers can plan ahead, I don't see why not.
They could use multiple writer hosts and rollover the restarts. MySQL has had GTIDs since 5.6 and replication groups rather than writer-replicas since some 5.7.x version.