| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by patio11 4061 days ago
	In case any other customer is wondering "Wait, I didn't hear anything from my monitoring about that and I'm retroactively worried. How worried should I be?" like I was: I just pulled our logs and reconstructed them, and it shows over the last ~30 days that the worse-case performance of our daily backup (~150 MB per day delta, ~45 GB total post deduplication) was about 40% longer than our typical case. This didn't trip our monitoring at the time because they all completed successfully. n.b. Our backups run outside of the hotspot times for Tarsnap, so we may have had less performance impact than many customers. I have an old habit of "Schedule all cron jobs to start predictably but at a random offset from the hour to avoid stampeding any previously undiscovered SPOFs." That's one of the Old Wizened Graybeard habits that I picked up from one of the senior engineers at my last real job, which I impart onto y'all for the same reason he imparted it onto me: it costs you nothing and will save you grief some day far in the future.

6 comments

vidarh 4061 days ago

Explicit support for randomizing timers across multiple hosts is a really nice features of the timers provided by systemd:

"AccuracySec=" in *.timer files lets you specify the amount of slack systemd has in firing timers. To quote the documentation "Within this time window, the expiry time will be placed at a host-specific, randomized but stable position that is synchronized between all local timer units."

You may still want to randomize timers locally on a host too, but the above makes automated deployment of timers that affects network services very convenient.

cperciva 4061 days ago

the worse-case performance of our daily backup (~150 MB per day delta, ~45 GB total post deduplication) was about 40% longer than our typical case

Yes, that sounds about right. I had maybe half a dozen people write to me who had noticed performance problems, and after the initial "backups failed because the server hit its connection limit" issue, it was people whose backups were already very long-running -- if your daily backups normally take 20 hours to complete, a 40% slowdown is painful.

NeutronBoy 4061 days ago

I run my backups overnight and get a status email each morning, and I didn't even realise there were performance issues until now. As you said, unless you run your backups multiple times per day, or have long-running backups, it may not have had a lot of impact.

FWIW, I live in Australia (so an 'off-peak' timezone), and schedule my cronjob on an odd minute offset, so it may not have been an issue for me anyway!

mtsmith85 4061 days ago

Hear hear on said Old Wizened Graybeard habit. The amount of pain inflicted from twenty jobs all starting up at :00 (or even :30, :45, etc.) when they could easily run at :04 or :17 can be huge. Anecdotally I once "lost" a sandbox server to a ton of developer sandbox jobs starting at :00 and not completing before the next batch started.

protomyth 4061 days ago

Funny part to that, was on a project with multiple teams with multiple crontabs. Each team took that advice to heart for some jobs. Sadly, we had too many Hitchhiker fans and :42 became a bit too common.

kijin 4061 days ago

Use the following shell command to decide when to run cron jobs.

    echo $((RANDOM % 60))

It's not a CSPRNG, but good enough for this kind of load balancing!

cperciva 4061 days ago

Or schedule your cron job for :00, but add "sleep `jot -r 1 0 3600` &&" to the start of the command. (jot is a BSDism, but I assume you can do the same with GNU seq.)

rlpb 4061 days ago

This is a pain when deciphering a series of events later, though, because you don't know when a particular job was supposed to start. I'd prefer the delay to be stable on a per-host basis.

JoachimSchipper 4061 days ago

Don't use that for hourly jobs, though - things are liable to break when you randomly run a command at, say, 12:59 and 13:00.

cperciva 4060 days ago

Right, I usually do that for my daily jobs.

junkblocker 4061 days ago

sleep $[RANDOM/3600] works everywhere without requiring jot/seq etc. on BSD/Mac/Linux.

hjnilsson 4061 days ago

That will be a number between 1 and 10 ($RANDOM only goes to 32767), sleep $[RANDOM/10] would be better. :)

This might be platform dependent though, I can't find any standard RAND_MAX in bash so it's difficult to make this work everywhere.

cperciva 4061 days ago

s/\//%/ I assume?

protomyth 4061 days ago

We just went with a single group text file with all the jobs and which ones could be spread out. Saves the programming and gives the sys admins / DBAs an idea what goes when.

NDizzle 4061 days ago

Don't run on :17 and :39. Those are mine. Thanks!

pquerna 4061 days ago

One way to think about your fear is, shouldn't that just be a tarsnap feature?

Add some metadata for a machine that tarsnap should expect a once a day/week/month backup from this machine, and if it doesn't get one, to send you an email?

patio11 4061 days ago

whistles

Until the day when Colin considers it in-scope for Tarsnap, I recommend Deadman's Snitch for this purpose. I literally spend more on DMS to monitor Tarsnap than I spend on Tarsnap. No, I don't think that is just, either.

RexM 4061 days ago

For those interested in patio11's thoughts on how he would run tarsnap http://www.kalzumeus.com/2014/04/03/fantasy-tarsnap/

And the discussion on HN https://news.ycombinator.com/item?id=7523953

dlgeek 4061 days ago

Did Colin ever reply to that? I've always wondered what his response was.

snowwrestler 4061 days ago

In this thread: https://news.ycombinator.com/item?id=9496561

ma2rten 4061 days ago

He did just now:

https://news.ycombinator.com/user?id=cperciva

ploxiln 4061 days ago

Don't you have some other servers running other services? So you must already have some monitoring and alerting system like Nagios, to which you can add one more little "passive check" that does the same thing, for no incremental cost?

patio11 4061 days ago

I have roughly fourish separate monitoring systems for Appointment Reminder. DMS is the one which is least tied to me, so I use it for Tarsnap (the most critical thing about AR that can fail "quietly") and as the fourthish line of defense for the core AR functionality.

(This may be slightly overbuilt, but I felt it justified to get peace of mind, given AR's fair degree of importance to customers/myself and the enterprise-y customer base. In particular, I would not have been happy with any monitoring solution which would fail if I lost network connectivity at the data center.)

$15 a month is far below my care floor for making sure that my backups are working and that I do not get sued into bits.

ploxiln 4061 days ago

Touché :)

StavrosK 4061 days ago

I'll second it (https://deadmanssnitch.com/). It's such a useful tool, it's saved my bacon more than once.

jldugger 4061 days ago

We actually have our Chef rdiff backup cookbook randomly distribute jobs across a buckets of time using a hash function of the hostname.

sillysaurus3 4061 days ago

I have to know: Why a hash function of the hostname?

pliu 4061 days ago

The chef-client cookbook does a similar thing in its cron recipe:

  # Generate a uniformly distributed unique number to sleep.
  if node['chef_client']['splay'].to_i > 0
    checksum   = Digest::MD5.hexdigest(node['fqdn'] || 'unknown-hostname')
    sleep_time = checksum.to_s.hex % node['chef_client']['splay'].to_i
  else
    sleep_time = nil
  end

https://github.com/opscode-cookbooks/chef-client/blob/master...

This is random enough so you won't kill the server, and deterministic so the resource isn't always changing every Chef run.

toomuchtodo 4061 days ago

No hash collisions, hostnames (in almost all practical environments) are never identical.

jldugger 4060 days ago

It's more or less random, but stable.

vacri 4061 days ago

I ran into this recently, backing up munin data to s3. I ran it at a time point offset from an hour to avoid those 'on-the-hour' rushes, but I was getting problems with the copy. Took me a moment to realise I was doing it on a 5-minute boundary, and munin fires on a 5-minute boundary - the data was being updated as I was copying it...

mental note: think harder, next time.