| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hga 3945 days ago

The Danger team did built a great system for its time and by the time Microsoft took over, it went downhill fast.

Microsoft managed one of the biggest cloud computing screw ups in history to date: https://en.wikipedia.org/wiki/2009_Sidekick_data_loss

The incident caused a public loss of confidence in the concept of cloud computing, which had been plagued by a series of outages and data losses in 2009. It also was problematic for Microsoft, which at the time was trying to convince corporate clients to use its cloud computing services, such as Azure and My Phone.

I've heard good things about Oracle's RAC, but it's understandably intolerant of your screwing up its disks (SAN mis/re-configuring) when you aren't properly maintaining backups. I also heard the consultants you have to hire after you manage such a feat are expensive.

2 comments

rodgerd 3945 days ago

> I've heard good things about Oracle's RAC, but it's understandably intolerant of your screwing up its disks (SAN mis/re-configuring) when you aren't properly maintaining backups

There are a number of problems with RAC, some of which are people using it wrong, and some of which are inherent to RAC. "Using it wrong" covers things like people not understanding it's on shared storage so it's providing compute node resilience, not storage resilience, so they probably sould spend on some Dataguard (or equivalent) unless they want to be the DBA equivalent of the server admin who thinks you don't need backup because you've got RAID.

The built-in problems come from the fact Oracle ASM doesn't check[1] the signatures on disks/LUNS presented to it. So if the SAN admin, I don't know, manages to somehow reverse the mappings for one LUN of 30 between the stress RAC and the dev RAC, Oracle will not start and say "that ASM disk has the stress signature on it"; Oracle will overwrite the stress LUN with dev data for a while, then go to read it, then discover it doesn't have the on-disk structure it expects, then crash with a SEGV or other entertaining but unhelpful error. But only after it's irretrvably corrupted the ASM group, of course.

[1] as of 10g, the last time I hit this problem.

link

bro-stick 3945 days ago

Yup. I managed some dataguard (not RAC) instances on AWS for Palm pre HP. Thankfully we had DR plans and snapshots to cover our asses.

Edit: Fixed HA techs.

link

mattzito 3945 days ago

How did you do RAC on AWS without shared storage?

link

bro-stick 3945 days ago

Thanks, fixed in edit. I was mistaken, it was dataguard and better snapshots using archive log mode.

link

mattzito 3945 days ago

I wasn't totally skeptical - we did RAC on AWS in tests back in the day using a third node as an iSCSI target, but it was a) sketchy as hell, b) not at all redundant, c) not something I thought Palm would go for.

link

bro-stick 3945 days ago

It might work on something like OEL with ZFSonLinux using zfs send/recv. Larger implementation might want to investigate drbd or something like OCFS2, GPFS, AFS or Lustre (none of which probably plays well with cloud environments). Maybe Gluster but with trepidation. (It was an AWS consulting shop with banking / military chops, whom could sell ice to enterprise eskimos.)

link