| Something we've found to be fairly lightweight (compared to e.g. Chronos), but incredibly featureful is using Jenkins (the CI server) as a cron runner. We use http://docs.openstack.org/infra/jenkins-job-builder/ to configure it at deploy-time so it lives as part of the deploy rather than system config. Here's a small list of things we're getting out of it: - concurrent run protection (& queue management via https://wiki.jenkins-ci.org/display/JENKINS/Concurrent+Run+B... ) - load balancing (e.g. max concurrent tasks) and remote execution with jenkins slaves [sounds complicated, but really jenkins just knows how to SSH] - job timeouts. No more hanging jobs. - failure notifications via slack/hipchat/email/whatever. [email only on status change via https://wiki.jenkins-ci.org/display/JENKINS/Email-ext+plugin ] - log/history management: rotation & compression. - fancy scheduling: e.g. run this job once every 24h, but if it fails keep retrying in 5 minute increments (https://wiki.jenkins-ci.org/display/JENKINS/Naginator+Plugin ). You could also use project dependencies for pipelines, but we've been staying away from that. - monitoring: we use the datadog reporter & alert on time since last success. Given how mature Jenkins is, this likely translates to whatever system you're using just as well. It's worked incredibly well for us. We migrated to Jenkins from crontabs with cronwrap (https://github.com/zomo/cronwrap). We're never going back. |
Once I had a job that went stray and got the disk full with logs. Since Jenkins couldn't write to the disk anymore, it stopped working completely and thus no jobs and more importantly no notifications. Funny thing, there was one job to monitor the free disk space but the stray app wrote ~100GB in less than 15 minutes (damn SSDs :p).
Another time (times actually), I had the OOM killer kill a jenkins related process. Being a JVM based app and starting with about 1GB of RAM use, doesn't help I guess. This lead Jenkins to hang on a job; timeout didn't work, I couldn't even stop the job manually. Other jobs wouldn't start and no notifications would be sent again.