I use Heroku for subscription software services, online retail stores, and phone ordering system for our staff.
Right now all of our sites are failing with 503 errors. Our store is down and when one of our employees went to take a phone order they got a "Welcome to your new app" message.
I've been a big evangelist of Heroku since we migrated over last year, but I'm getting deeply concerned about the elevated error rate since every minute is costing us money.
M = 1000 in Roman numerals, but the confusing bit is not reading them like Roman numerals (2000) but rather interpreting them as one thousand thousands, or one million.
This came up in a thread a while back (http://news.ycombinator.com/item?id=1483667). Bottom line is that "MM" comes from the banking/finance world. In banking, $3M actually means $3,000 and $3MM means $3,000,000.
My question to you is, could you do better and how much would it cost? If you didn't use Heroku or another cloud provider you would pay a lot more up front to get your applications running. When things go wrong you would have to fix it, which means paying technical staff to be on call. Since you and your company are likely experts in your domain and not in infrastructure then any infrastructure that you built would likely have more downtime than Heroku. You have to debit the cost of Heroku's downtime from the cost of building your own infrastructure.
Disclaimer: I'm not affiliated with Heroku and I don't use their service.
You have to debit the cost of Heroku's downtime from the cost of building your own infrastructure.
That's silly, and also not how it works, at all. You're paying PaaS/IaaS companies so that it's their headache, not yours. Once it becomes your headache, they are no longer doing their job, and you are no longer receiving value for which you are paying for. You don't debit the cost of their downtime from the cost i would've built your infrastructure, you debit the cost of their downtime from your business' revenue and reputation.
Whether or not you could do it better yourself does not excuse the downtime one bit.
Isn't it commonly recognized that it's cheaper to run your own hardware than to pay a cloud provider? It just requires more capital outlay and maintenance.
We're actively sending pay per click traffic to our online store and it's very easy to spend hundreds of dollars. When our traffic converts it's great but it pains me to think I could be sending traffic to a Heroku 503 error page and have zero control over it.
So, "could we do better"? I'm not sure. I'm trying to figure that out. It certainly would not be as easy to use as Heroku or easy to deploy. But at a minimum I need to get some other host option set that we can switch over to.
I've had clients with sloppy dev teams who decided to change the URL structure of all landing pages without letting me know (I was managing their PPC campaigns). Google stops serving ads after getting 404 errors - unfortunately I don't think they count other errors (like a 503) and they don't stop until they've sent a few hundred (or thousand) clicks.
If you run your own site and things go wrong, you (hopefully) know what you did. When Heroku (or AWS, or anyone) makes a change, they don't consult every customer to find out if now is a good time to go down.
The E&O insurance I looked into getting when I moved into consulting would have covered it -- "lost sales" resulting from a "hardware or software malfunction." I assume if you start making recurring claims on that the insurance company will reevaluate whether they want to continue doing business with you, though.
Don't know specifically about "server failure" insurance, but I assume it exists. There is insurance for practically everything. For example, an Uncle of my friend builds home security systems, and he is insured in case a home he has secured is ever broken into anyway.
I've currently got a few small apps on Heroku and am considering moving some larger ones over. But the "Heroku | Welcome to your new app!" is very worrying.
Custom error pages for these kinds of errors would be very useful.
It took Google App Engine two years to add the option to specify a custom error page for server errors and over quota errors. Hopefully that'll come soon for Heroku as well…
It just occurred to me that you know you've made some pretty serious traction as a startup when HN posts about your company no longer have something like "(YC W08)" appended to the end.
I think it has to do with Heroku's target market being so similar (identical?) to HN's demographic. If that weren't the case, I doubt we would all recognize Heroku so easily.
Thanks for pointing that out, because I had completely forgotten that this was the case. (I actually can't remember at all, but I figure that I knew that from when they came out)
They did go a long way in a short period of time. Winter 2008 feels so close.
Yeah, I'm in the same boat as you. I see successful, "big" companies mentioned here with "YC-whatever" on the end and am blown away by which ones are YC alumni!
It all depends on what the SLA says, but hypothetically, if they are down for 24 hours a year, that's 99.7% uptime, which isn't terrible.
Heroku had a 1-2 hour outage the week after we switched an app there last year. My boss was freaking out, cursing about how they were unreliable, etc, neglecting the following:
1. The timing was unfortunate, but that was the first outage in months.
2. We had had multiple outages on our Rackspace box that were our own fault, due to bad server management.
In the long term you're likely better on Heroku, for small companies at least.
If shadowcat's public facing website is down for a day, a few people can't read blog posts and maybe we'll miss out on a potential customer - but our existing customers will be entirely unaffected.
If our ticket tracking system is down for a day, it'll annoy the hell out of the existing customers but we can still get the work done since they all have direct email and IM contact info for people.
On the other hand if our ircd is down for an hour, it's time to panic, because that massively interrupts our ability to co-ordinate our work.
External examples:
If linked in is down for a day, I don't care - anything I do on that can wait until tomorrow.
If duckduckgo is down for a day, I am going to burst into tears because I use it all the time for information I want -now- and going via google is substantially more annoying.
So "anything that matters" is really quite relative.
I just did the calculation. That's about a day of downtime. I'd say it's bad if:
- The downtime is scattered all over the year. 1 hour downtime here, 30 min downtime there.
But not if:
- This 1 day of downtime is scheduled, e.g. during the holidays. Scheduled and planned is the keyword. If the client is informed and aware of it, the client will also remain happy.
You'd be surprised how much downtime clients are willing to put up with, as long as they are informed well ahead of time.
Of course nothing will have 100.0 (repeating)% uptime. But 99.7% uptime means it can be down for over 2 hours every month. Anything less than 99.9% uptime (which means 3x less allowed downtime--a big difference) is probably unacceptable, and if downtime costs you serious money, you're going to want more decimal places.
Part of my job is network administration of a small (~50 server) colo/hosting service. It's unacceptable for us to be down for even 30 minutes (from our perspective and our clients). We maybe top out at 5 hours of downtime a year (during a bad year) and most of that (unfortunately) is upstream from us.
We've been running on Slicehost for almost 2 years and I believe we've had two outages, one of which wasn't a real outage but a backbone provider went kaplooey in Europe. That can't really be helped.
Heroku, on the other hand, feels like it's up and down more than... something that goes up and down a lot. A friend of mine hosts his blog there and he launched a small product today and he kept sending his customers to an error page, because Heroku was up, down, up down, up down.
If it's a misconfiguration of your own, you can get it fixed. But if your hosting provider has an unsound business, you can't fix that except by leaving.
I have an app running on Heroku. Interestingly, it caches itself using HTML 5 application cache, so most people won't even notice the site is down. Need to make sure the background network ops are fault tolerant though.
That is really awesome. I just got back from playing with it between Chrome on an old 800MHz P-III (very usable) and an Android (Nexus One). On the Nexus, I went off-line (airplane mode), edited, and then went back on-line. MAGIC! My edits showed up in my Chrome browser on the desktop.
My use case is that I want to use Google Docs (or equivalent) to keep notes while on-line and off-line. MotoDiary ain't quite there yet, but it has the hard part (IMHO), the on-line/off-line syncing. What is rough is text size and fixed(?) edit box size on the Android. Also (obviously), it is diary-oriented (single entry per day) rather than supporting multiple documents.
Google Docs are totally uneditable (?WTF!) on Android, never mind doing it off-line and syncing.
There are some Apps that work better, such as GDocs. GDocs has been a mixed bag, it allows me to edit off-line and sync docs, but has been iffy in terms of success rate. It definitely isn't as smooth as my brief experience with MotoDiary.
You have to give Heroku credit for selling major quantities of Kool Aid. They've been pretty flakey for the past couple of months, and people are here claiming that this is the first outage. Someone's even claiming that 99.7% is a good record.
It's worth noting that this was not universal as far as I can tell.
I have 5 minute watchdogs on all of my 3 sites in production with Heroku, and none of them pinged me. Given that I know the watchdogs work (regular testing and previous incidents) I would have to conclude that not everyone was affected.
We've been monitoring a heroku instance for the past 8 months. Our current instance uptime is 99.953% (about 200 minutes of downtime). Of the 76 services we monitor, Heroku is #64.
The magic of cloud computing: As someone running an app on Heroku, I had no idea. Luckily, I simply don't care.
Our app has a cyclic usage pattern and all is quiet right now. So rather than freaking out about it, I'll just let someone at Heroku figure it all out.
It would suck if it happened during our busy period, but then again I could say "We're working on it." and just assume the Heroku team will fix things faster than I ever could have with my limited *nix admin skills.
How exactly is the fact that you didn't know about the outage "the magic of cloud computing"?
I get that you're saying your users don't care/didn't notice, but I'm clearly missing something because if I had an app on Heroku, I'd be a little nervous. When the cyclic nature of your app swings back around and it's in regular use again, this kind of outage might not be so magical.
Well technically, I was informed of it. I got email alerts and stuff, but I was busy doing other things, so I didn't read them.
Users surely noticed, but Heroku definitely noticed before my users did. They're quietly working on a solution and I can quietly go about my day. If my users start complaining, I'll have time to talk to them; time I wouldn't have if I was neck deep in log spew.
Having run apps on my own servers before, I know what a pain in the ass it is to deal with downtime yourself. I'm not particularly good at it, so I appreciate having experts take care of it for me.
Not unique to it, but it is implicit in it. This matters. If you are at the size where you can't have a dedicated staff monitoring your uptime 24/7 than you are at the size where a cloud solution is going to be more responsive than what you can afford.
It's sooner or later for most people to realize that, it is not that safe to rely on a specific deployment system that is not directly controllable. It could be dangerous to use a full stack that cannot easily be replaced without a decent amount of efforts.
I host an app on there that I've been using all day and I didn't notice it go down. I reckon I've got some kind of unplugged-TV poltergeist action going on.
I haven't seen an explanation for this, but I could be related to ec2 issues today. I'm a heroku user. Downtime with any host always seems to happen with bad timing, during a daily client call today. However I'm not concerned about heroku - yet... I think they have less downtime than I would have doing it myself.
Right now all of our sites are failing with 503 errors. Our store is down and when one of our employees went to take a phone order they got a "Welcome to your new app" message.
I've been a big evangelist of Heroku since we migrated over last year, but I'm getting deeply concerned about the elevated error rate since every minute is costing us money.