Hacker News new | ask | show | jobs
by InvaderFizz 1576 days ago
Agreed. It's sad how many SRE candidates are positioning themselves as senior and end up being someone with basically nothing more than surface level knowledge of AWS manages services.

I have a relatively similar approach as the article:

1. I have them walk me through a production dockerfile and explain what is going on (it is not a very complex dockerfile).

2. I have them troubleshoot a broken web server with the most basic scenario (public web server that we run, ec2 with apache2, public ip, no load balancers, no cdn, just a VM serving a static html page on xyz.actualdomain.com).

Things that are broken:

1. NXDOMAIN (I make sure to unset the DNS before the interview)

2. Security group isn't allowing 80/443

3. NACL isn't allowing egress

4. iptables isn't allowing 80/443

5. Apache2 is stopped

6. Apache2 is bound to 127.0.0.1:80/443

7. Self signed TLS

I don't require specific incantations, I know I can't write the iptables insert without looking up an example. But I do expect candidates to know where and what to look for to troubleshoot the entire chain. They are allowed to run any command they want (I'm running it on a screen share). If they get stuck on a portion, they get some hints, then finally they get given the answer to get past it. My goal is not a gotcha, my goal is to see how they attack the problem and if they are at least familiar with how things can break and have guesses at fixes.

Fully a third of interviewees get stuck on NXDOMAIN, which is just shocking to me interviewing people that, on paper, have over a decade of deep Linux and cloud experience.

To me, the scenario I present is basic troubleshooting and something that should be a breeze for most candidates.

1 comments

While I too have seen strange knowledge gaps in interviews (interviews for JS developers who can't write a function that adds two numbers), the NXDOMAIN issue doesn't surprise me too much.

Maybe it's different for SREs, but as a fullstack web developer, unless you've got a greenfield project, usually someone else sorted out DNS a long time ago. And unless you're working on something that changes DNS a ton, nobody has touched DNS in a long time.

Additionally, I've seen NXDOMAIN way more often as a local machine configuration problem, rather than as a production environment DNS problem.

So if I'm going in to debug a server, but then I see an NXDOMAIN, I could see myself getting stuck wondering just what else in the world is broken. If I was doing the test on my own hardware, I might panic that my machine is in a bad state. If I was doing the test on hardware the interviewer provided, I might start wondering if this is some kind of trick, and I have to debug a broken client and a broken server.

Then again, maybe if I went back to those people who couldn't add two numbers in JS, they'd have a great explanation too :)

For SRE, DNS is a relatively common issue to troubleshoot. Something silently fails in a deploy, someone went mucking around with DNS by hand when they shouldn't have, your local resolver might just be borked.

Specifically for us, since we are a SaaS provider with new customer environments in their own VPCs every week, DNS is something we touch regularly. We touch it waaay more by hand than we should, but that is one of many processes I am fixing and automating to remove the cognitive load and human errors.

You are right that it could be a client issue, but when candidates start down that path and won't let it go, I tell them to pretend they're on a residential internet connection, no corporate shenanigans of any type.

I also expect them to know how to rule out their local machine quickly (they can always run nslookup xyz.actualdomain.com 8.8.8.8). I intentionally use real problems I have encountered so that the scenario is something they can expect to fix(not usually all at once).

I'll probably do a revision on the process next week with a terraform deployment to build the scenario out automatically for each interview. I'll be asking candidates to send me a ssh pubkey before the interview. I need to get another AWS sub-account too so I can issue credentials for the candidate and literally give them the keys and let them drive instead of me.