Hacker News new | ask | show | jobs
by etm117 5747 days ago
The way I read it was the problem was corruption inside the database and the warm backup was corrupted during the automatic mirroring before they noticed the problem. So at that point, both the PROD and Failover instance are busted once the issue was determined. To resolve, it looks like they had to rollback to the last valid full DB backup from Sunday and then apply the log backups iteratively from Sunday to catch up the DB before bringing it back online.

At my shop we had a similar issue (but at the SAN level, not the DB level) where the corruption issue was data that exposed a bug in the system. The data was automatically mirrored to the warm standby machine. When PROD crashed, the standby was brought up and immediately crashed also. We had to rebuild from tape backups which was stupid-slow (trademarked term there ;-). All in all it was a horrible mess that was root-caused to a bug in vendor firmware. Eerily similar to the JPMorgan Chase issue in the OP.

2 comments

I'm guessing they were using the storage to do the replication, rather than DataGuard to replicate and RMAN to make the initial copy, which checksums the blocks on the way - it'll tell you off the bat if you have any block-level corruption, there's no way for the storage to do this because it can't tell a valid Oracle block from any other sort of block. Because DataGuard is Oracle-aware, you always have a valid standby - if the primary datafiles are corrupt, you can still ship the redo logs (which you will be multiplexing too).

I'll also hazard that they did it this way because some "enterprise architects" designed the system - no Oracle DBA would have done it like that for precisely those reasons.

NoSQL absolutely would not help in this case. If you are trading on the web you need the clickstream for the regulators, just like a bank tapes every phone conversation.

If you're keeping ALL of your user profile data in ACID-compliant databases you're probably doing it wrong.

Large modern websites store tons of information about a user which may not in any way be necessary to even keep for anything other than data mining, or perhaps preferences, click/hit tracking, etc. I can't see how such data is important in any way in regards to finances or trades, so why it couldn't be done on a much-less-resource-intense database solution I don't understand.

Moreover, the cascading effect of a database failure is made much worse by putting all your eggies in one basket and depending on this one cluster of databases to keep the whole ship afloat. In a good design scenario, much of the site should still keep operating even if the backend databases are timing out from load. For example, your cache layer (if not expired) should continue serving cached content/logins/etc. This may not be as useful for clients that sign in randomly or throughout the day, but for people who use the site frequently or stay logged in throughout the day their sessions should stay active in this scenario.

The content in the user profile which doesn't require ACID compliance could also be using caching and nosql/mysql/etc which would keep the apps working even longer in the event of an outage of a particular piece of technology. Because this technology doesn't require some of the more complicated requirements of Oracle RAC it may also be easier to recover/restore old data, again assuming this doesn't have a particular need for ACID.

I can't see how such data is important in any way in regards to finances or trades

Well, umm, it is. If you phone your broker and just chat about your cat, that will be taped too, and the tapes kept for 7 years.

for such a shop as Chase it sounds kind of simplistic and cheap (though i don't think they bought it cheap :) - only 8 machine cluster, only one standby, no flashback ...