I'm no server admin, but it seems to be a recurring theme where big issues are narrowed down to disk space running out. Is there not something that can automatically check this and send out alerts?
There are a lot of solutions that range from "solve this immediate problem now with minimal work by me" to "solve this problem and a host of problems that I don't have now but will have in the future". The trick is figuring out where to be on the spectrum.
For example, perhaps the simplest solution would be to cron a script that checks 'df' output and sends an email as soon as you hit some reasonable threshold.
More complex but significantly more powerful is running something along the lines of Nagios to monitor not only disk usage, but a plethora of other systems level checks.
Once that road is walked it's not a big leap to start monitoring the application itself.
Why stop there? If you've got your metrics system (like Graphite) up and running, you can pull in raw metrics and trend your disk usage over time. Write a script that pulls in the raw data (add rawData=true to your parameters in Graphite) and then set thresholds on that. Have Graphite take the standard deviation of your disk metric and now you're alerting not only on an absolute threshold, but monitoring for sudden spikes in activity.
You may also very well be able to get "more complex" without your own infrastructure ... with the tradeoff being money and relying on 3rd party SaaS. There are pros and cons involved here.
Circle back, for a second, though. Putting in a complex solution that gives you the kitchen sink requires time and money. Nagios and Graphite are adding a layer of complexity that may be totally overblown for your needs at the moment. SaaS might not fit the bill. Right now may NOT be the time to go all crazy. So start simple. Get that cron job in place today, gain a little piece of mind, and then figure out what your next steps should be.
1. Set up an alert at a conservative usage to make sure nothing like this can happen
2. See alert and know you have plenty of time to fix the issue
3. Get distracted
4. Disk space disaster
We use AppFirst for our monitoring alerts. One thing they don't support is sending recurring alerts while something is over a threshold. They only send when thresholds are crossed.
Right now we're experimenting with PagerDuty reading the AppFirst alerts and then seeing it as an open issue.
Yes, and everyone should be doing so. For EC2 Amazon provides sample CloudWatch scripts[1] that will report additional metrics, including storage space. All server monitoring tools and services can (and should) watch your disk space.
If you're not monitoring basic problems like disk utilization and RAM you're just asking for unnecessary downtime.
Parsing the output of df in a cronjob and echoing an error message is a trivial thing to do. Run that cronjob every 30 minutes and configure mail correctly on your box.
This was the only alert I wrote myself for my startup (the rest are powered by @newrelic). Saved me many times. Usually only happens when some log goes out of control unexpectedly.
For example, perhaps the simplest solution would be to cron a script that checks 'df' output and sends an email as soon as you hit some reasonable threshold.
More complex but significantly more powerful is running something along the lines of Nagios to monitor not only disk usage, but a plethora of other systems level checks.
Once that road is walked it's not a big leap to start monitoring the application itself.
Why stop there? If you've got your metrics system (like Graphite) up and running, you can pull in raw metrics and trend your disk usage over time. Write a script that pulls in the raw data (add rawData=true to your parameters in Graphite) and then set thresholds on that. Have Graphite take the standard deviation of your disk metric and now you're alerting not only on an absolute threshold, but monitoring for sudden spikes in activity.
You may also very well be able to get "more complex" without your own infrastructure ... with the tradeoff being money and relying on 3rd party SaaS. There are pros and cons involved here.
Circle back, for a second, though. Putting in a complex solution that gives you the kitchen sink requires time and money. Nagios and Graphite are adding a layer of complexity that may be totally overblown for your needs at the moment. SaaS might not fit the bill. Right now may NOT be the time to go all crazy. So start simple. Get that cron job in place today, gain a little piece of mind, and then figure out what your next steps should be.