Hacker News new | ask | show | jobs
by eks 3082 days ago
It's scary. Why not just revert to the old engine when things started to fell out of place? Having the exchange down for that long is bound to have a big backlash when it comes back online.

They should have put a Kraken 2.0 trade engine alongside the first one, and moved people gradually there. It doesn't matter how confident they were with the upgrade before it happened, it's crypto, everything is new. A few lines of wrong code and you can lock millions of dollars in multi-sig wallets.

I have most of my funds there because their eur SEPA transfers worked very well. I really hope they can get back in shape after it comes back online.

3 comments

Their downtime and non usable site was so bad already that downtime is almost as inconvenient as the working site was before. Now at least they are fixing it. They held back the update for months due to testing. They have to take the jump and after x hours of downtime the damage is done, so fixing it once and for all instead of rolling back might actually be the better solution.
I notice the "upgrade is coming" notice on their site last week, but I don't have a Kraken account, I was just interested in learning more about crypto-currency trading. Now they made me curious about their platform, for a purely technical perspective.

As you say, why didn't they just revert? Are they not able to? What are the steps in their system upgrade? Are they moving to new hardware? What's their setup like? What software are they running (custom written surely, but what language, which database technologies?)

Incidents like this make me curious, and I would love to read the post mortem on something like this.

Second the notion of a post mortem, but I assume due to their focus on security they value security by obscurity as an additional factor.
Explaining why an upgrade didn't work doesn't compromise security. I'd guess moving the data didn't work or they corrupted a database and don't know how to repair it.. Explaining that you don't have proper data backups in place can be embarrassing but with a post mortem you can at least get some trust back (like Gitlab's incident).

Not explaining why you're offline for 24h doesn't help people to trust you

If they corrupted a live database and are not able to recover it they are in a world of hurt. While it is bad form, many people keep their coins on the exchanges and even if the bulk of an individuals coins are offline, they still likely have at least a small amount on their for trading.

If a table that connects user accounts to kraken owned wallets is corrupted and not recoverable people will be out millions. For some that would be the equivalent of your 401k issuing a post mortem for losing all of your retirement.

If this worst case scenario happened they are likely in severe damage control.

Most likely explanation though is that things are just taking longer than expected to upgrade what is by all measures likely a very technical and convoluted system.

It boggles my mind that it's in the state it is. Some kid in his basement knows you do staging rollouts and build in parallel, this is seriously the most basic IT knowledge in existence - and yet they still screwed it up, now over 24 hours later.

Whoever their CTO is, is clearly the worst kind of incompetent and the team is the most amateur I have ever seen in my 20 years of Internet systems management.