The current status update says that its now resolved:
>Update June 14th, 2014, 10:20 AM PDT: Stability on the DreamObjects cluster has been restored. Requests appear to be resolving properly now that the system has had time to re-balance itself. Our test are reporting properly now. If you do have any questions or concerns, please contact support contact support.
I'm quite interested to know what went on in particular, as I'm far more interested in Ceph than in commercial object stores that I can't extend. Librados is pretty damn awesome too, and I can foresee implementing some highly distributed storage through that directly.
With DreamObjects, it sounds like some API servers went down, and failure happened such that it couldn't serve some requests until the appropriate nodes came back.
It appears that with Ceph it will be easy to keep enough replicas such that data is not lost, but high availability is still being hashed out. Hopefully the lessons from this failure guarantee that this particular failure mode doesn't happen again.
Status update says they're running Ceph, an S3 work-alike. I don't want to say "clone" because nobody outside Amazon really knows how S3 works, and Ceph merely has an external HTTP API that supposedly works the same way. Is anybody running Ceph at scale and might be able to comment on what broke down?
I'm interested in Ceph-in-production stories as well. I subscribe to this feed http://www.sebastien-han.fr/blog/ which follows Ceph development and have been toying with the idea of testing it when time permits. Currently we use local storage for VM hosts provided by LVM but something distributed and designed with HA in mind would be helpful.
> I don't want to say "clone" because nobody outside Amazon really knows how S3 works
You can deduce quite a bit about S3's system architecture from the kind of time and consistency guarantees it provides, the optimal and pessimal cases for bucket key naming, etc. Riak CS hits pretty much all those same notes, so I would hazard that you can get a good high-level sense of S3's architecture by looking at Riak's.
For people who are actually on a budget for their projects, which is more than you might think, if the reliability is OK, this might make a lot of sense over paying much more to use the real S3 for backups.
The name of the service is DreamObjects. No affiliation to Dreamhost but doing something as basic as storing files as a service hardly deserves to be called a clone.
It's not just storing files as a service in some generic way. It's storing objects using exactly S3's API (or at least a subset) so e.g. programs and libraries written to use S3 can use DreamObjects instead by just changing the base URL. That's a clone of S3 as much as Linux was a clone of UNIX, and as much of a clone as there can be when the original is closed source.
Disclaimer: I'm on the GlusterFS team at Red Hat, which puts me pretty close to the Ceph team but not actually one of them. I've never had anything to do with Dreamhost.
I've only had very minimal issues with S3... one of the things thats holding me back from switching to a cheaper provider from Amazon is if its not broke, don't fix it.
Individual blob access is not very reliable on S3. In a previous company uploading big-ish (1-10000MB) files was part of the service. We would see failing uploads or slow writes all the time. Make sure to queue and wrap your upload script with retries.
I wasn't saying it was, merely reporting the facts! S3 targets four nines uptime; which is really good but not infallible. For most apps though, I think it's reasonable to accept that if S3 is offline, the app is offline also.
All of AWS has changed significantly in the past 6 years. I think it's disingenuous to say that an outage from that era, for any provider, influences your thinking nowadays.
I'm not sure S3 has changed nearly as much as the rest of AWS, which is probably why it is so stable. Regardless, the point is that outages can happen to S3 as well, and have; just because it hasn't happened for a few years doesn't mean than 99.99% = 100%.
>Update June 14th, 2014, 10:20 AM PDT: Stability on the DreamObjects cluster has been restored. Requests appear to be resolving properly now that the system has had time to re-balance itself. Our test are reporting properly now. If you do have any questions or concerns, please contact support contact support.
I'm quite interested to know what went on in particular, as I'm far more interested in Ceph than in commercial object stores that I can't extend. Librados is pretty damn awesome too, and I can foresee implementing some highly distributed storage through that directly.
With DreamObjects, it sounds like some API servers went down, and failure happened such that it couldn't serve some requests until the appropriate nodes came back.
It appears that with Ceph it will be easy to keep enough replicas such that data is not lost, but high availability is still being hashed out. Hopefully the lessons from this failure guarantee that this particular failure mode doesn't happen again.