Hacker News new | ask | show | jobs
by georgefclay 5749 days ago
I agree. Also his comment about "over engineering" making the system "more brittle" was odd.

For data that important, I would have mirrored the databases to "warm standby" servers. They could have been back up in minutes with no data loss. Sure it would have doubled the cost, but how much money did they lose during the outage.

3 comments

You completely failed to read the article.

Otherwise you'd know that they had a fault that propagated to the hot spare. It's also utterly daft to think that a financial enterprise as large as JPM/Chase wouldn't already be running a HA setup. In this case it appears to be Oracle RAC.

I'm astounded how often I have to remind people that replication and backups are very different things, and that you need both.

I'm also depressed how many utterly thoughtless comments are made here on hackernews lately.

No Oracle RAC shares the same storage between two or more nodes.

What they had here would appear to be database A running on storage A which is replicated at the storage level to storage B where database B waits in an idle state. Because the replication system is "blind" - it only sees its own filesystem containing bytes, not Oracle data structures - it can't tell a good Oracle block from a bad one and copies it.

I do this sort of setup for a living and you would be amazed at how many "architects" there are around who have completely drunk the storage vendor kool-aid and don't really understand how anything works (not even storage...).

This is likely the case since the post mentioned that storage controller was initially blamed (but cleared).
I rarely work with Oracle, so I'm not very familiar with the product line. Thanks for the correction.
I did read the article that was referenced. I did not read the article that that article referenced. My point was about the comment on "over engineering". This problem was not cause by over engineering.
It says right in the article that was referenced:

Before long, JPMorgan Chase DBAs realized that the Oracle database was corrupted in about 4 files, and the corruption was mirrored on the hot backup.

The way I read it was the problem was corruption inside the database and the warm backup was corrupted during the automatic mirroring before they noticed the problem. So at that point, both the PROD and Failover instance are busted once the issue was determined. To resolve, it looks like they had to rollback to the last valid full DB backup from Sunday and then apply the log backups iteratively from Sunday to catch up the DB before bringing it back online.

At my shop we had a similar issue (but at the SAN level, not the DB level) where the corruption issue was data that exposed a bug in the system. The data was automatically mirrored to the warm standby machine. When PROD crashed, the standby was brought up and immediately crashed also. We had to rebuild from tape backups which was stupid-slow (trademarked term there ;-). All in all it was a horrible mess that was root-caused to a bug in vendor firmware. Eerily similar to the JPMorgan Chase issue in the OP.

I'm guessing they were using the storage to do the replication, rather than DataGuard to replicate and RMAN to make the initial copy, which checksums the blocks on the way - it'll tell you off the bat if you have any block-level corruption, there's no way for the storage to do this because it can't tell a valid Oracle block from any other sort of block. Because DataGuard is Oracle-aware, you always have a valid standby - if the primary datafiles are corrupt, you can still ship the redo logs (which you will be multiplexing too).

I'll also hazard that they did it this way because some "enterprise architects" designed the system - no Oracle DBA would have done it like that for precisely those reasons.

NoSQL absolutely would not help in this case. If you are trading on the web you need the clickstream for the regulators, just like a bank tapes every phone conversation.

If you're keeping ALL of your user profile data in ACID-compliant databases you're probably doing it wrong.

Large modern websites store tons of information about a user which may not in any way be necessary to even keep for anything other than data mining, or perhaps preferences, click/hit tracking, etc. I can't see how such data is important in any way in regards to finances or trades, so why it couldn't be done on a much-less-resource-intense database solution I don't understand.

Moreover, the cascading effect of a database failure is made much worse by putting all your eggies in one basket and depending on this one cluster of databases to keep the whole ship afloat. In a good design scenario, much of the site should still keep operating even if the backend databases are timing out from load. For example, your cache layer (if not expired) should continue serving cached content/logins/etc. This may not be as useful for clients that sign in randomly or throughout the day, but for people who use the site frequently or stay logged in throughout the day their sessions should stay active in this scenario.

The content in the user profile which doesn't require ACID compliance could also be using caching and nosql/mysql/etc which would keep the apps working even longer in the event of an outage of a particular piece of technology. Because this technology doesn't require some of the more complicated requirements of Oracle RAC it may also be easier to recover/restore old data, again assuming this doesn't have a particular need for ACID.

I can't see how such data is important in any way in regards to finances or trades

Well, umm, it is. If you phone your broker and just chat about your cat, that will be taped too, and the tapes kept for 7 years.

for such a shop as Chase it sounds kind of simplistic and cheap (though i don't think they bought it cheap :) - only 8 machine cluster, only one standby, no flashback ...
Perhaps you don't understand how RAC works. A RAC cluster is cache-coherent with a shared disk system, in this case an EMC SAN. It's designed to be both scalable and fault tolerant. The replication would have been handled by the SAN itself, at the block level. There would be two completely independent (edit:DISK) cabinets that would replicate synchronously. Some software assumes synchronous replication and it's cheaper to just spend a ton of money on an expensive replicating SAN and Oracle RAC than it is to rebuild the software, so an async replication scenario is out of the question.
No, no, no. The standby is not open for queries in that scenario. How can it be? It's playing no role in this setup, all the work is being done on the storage, it's not even aware of it until you try to activate it and it takes ownership of the controlfile.