Hacker News new | ask | show | jobs
by oxff 1535 days ago
> Update - We are investigating multiple possible causes, including database changes and code changes.

Sounds like they haven't got the first clue about what is causing it.

2 comments

I'm not all that surprised. A friend saw a phishing email that was imitating them because they lacked a DMARC record. Sent them explicit instructions on how to fix it by adding a DMARC policy and all they did was create a p=none record that doesn't prevent direct imitation. That's definitely the first step, but eventually you need to turn it up to p=quarantine for it to do you any good and it's been a while (several weeks). Shouldn't have needed a random user to point it out in the first place.

I just don't have a tremendous amount of confidence that they take their infrastructure seriously at this point.

To be fair, DMARC quarantining is actually a pain in the ass and will likely break things for people outside of engineering or IT. In a growing or big company, there are always more and more legitimate emails from third-party senders added all the time.

I agree that reviewing is the first step, but not everyone needs to take further steps. And I highly doubt CircleCI is unique here. I think it's a massive leap to conclude "lack of confidence in taking their infrastructure seriously" from not knowing the reason why they haven't flipped the switch from none to reject or quarantine.

Technically sophisticated users know that email spoofing is already rampant and to watch for signs of it in their email client. I'm not saying it's not a good idea, but that flipping the switch is not that simple and comes with significant downsides in a company with many services and users.

IMO I think going to the next level with DMARC is usually more of a prioritization or cost-benefit analysis type decision than a competence once.

Everyone absolutely needs to take the next step. Without it, you're inviting direct phishing against your user base.

For an core devops tool, that's not okay.

I don't disagree with you about the value of its security benefits from a technical perspective. But if you tested this against the top 100 websites to see how many have actually implemented it... well, I'd be curious to see the results.
This may satisfy your curiosity: https://dmarc.org/stats/alexa-top-sites/dmarc/
So they did the thing they were recommended but didn't take some further steps, on this one issue. Clearly that means they are totally incompetent? Even though the people dealing with DMARC issues are probably IT & Marketing, not the DevOps & Engineering people who are running the product.
A p=none record is barely different from not having a record at all...and yes at this point a tech company without an enforced record is a major red flag. It's been a decade since the standard went public, it's required at the federal level already and in many EU countries it's being mandated for businesses in general.

Most 3rd party senders today already insist that you setup DKIM as part of your setup process and if that happens, you're going to pass a DMARC check. It's hard to setup for older companies with thousands of servers in their own data centers that are each individually sending email. Cloud native companies sending their email through a few 3rd parties like Sendgrid/Postmark or a newsletter tool are EASY to setup.

I'm mentioning this on a post about their infrastructure being down for 6 hours because yes, it's related. Email delivery for the primary domain is absolutely an IT, Engineering, Operations and Security problem, not a marketing problem. It goes directly to the application especially when one of the main facets of the application is to send emails about your repos and login credentials.

Blame shifting it to the marketing department does not hold up.

When multiple people are commenting on this post about just how frequently their outages are happening it shows a problem in the overall infrastructure mindset for it to continue. Maybe they know exactly what the problem is and somebody higher up is keeping them from fixing it in order to prioritize other things.

Either way, for company that's supposed to be providing a core devops function to have outages that frequently as well as making it dead simple to spoof email that looks like it's coming straight from them...it's not a good look.

Just linking the short guide I wrote (mostly for myself) to help with email auth stuff: https://www.uxwizz.com/blog/stop-others-use-your-domain-emai...
It has been going on for almost 6 hours now. It does feel like they haven't got a clue.
I don't think you can say that. They might know what happened, but it could still be hard to recover from.

Catastrophes happen. If you deploy something that has a destructive migration you can't easily roll back without reverting to a backup and there's a problem that you've not seen in QA then you're in for a bad time. This is compounded if you also discover your backup process hasn't worked properly for a while. If that happens you're facing some serious downtime, and the dilemma of either trying to fix the problem, or trying to rollback to the last working backup.

There's a good reason why grumpy old devs like me insist on writing docs, having playbooks, testing everything including non-code stuff, and we still fear major deploys. I have scars from exactly those sorts of disasters.

Hopefully the devs at Circle get past this with as little stress as possible, and they learn from what went wrong.

You are totally right, catastrophes happen and I also wish that they get past this with as little stress as possible. The whole reason of my assumption was the lack of description in their updates for the incident that is going on for 6 hours. Maybe little more detail would give me a hint that everything is under control, but I didn't feel that when I read their updates.
When facing such large scale issues, communicating properly is very hard: Several teams might be investigating several possible root causes in parallel, and you might change your mind over time as to what is the most probable root cause.

So you might end up communicating something ("we think it comes from X, we're fixing it that way"), just to find yourself changing you mind a few minutes later.

Changing your message is usually not well perceived, even though that's actually normal during an investigation.

I would not like to be in charge of the communication. Finding the balance between saying too much or too little is tricky.

It’s probably just because they had the choice of either focusing all their energy on fixing the problem asap or setting aside some of it to write a more detailed description that’s also fit for public consumption. Given the severity, they probably chose the former since whatever descriptive, reassuring description they put out there isn’t going to be actionable anyway.
Hi folks. As the CircleCI CTO, I appreciate your patience here and all the feedback. It's true that we are focused on getting customers moving again over sharing more detailed information, but will aim to do better in providing a bit more in our updates. status.circleci.com provides real time updates for both how we're tackling outages and more detailed incident reports. We will post more information there about this incident once we are on the other side and have comprehensive detail.