Hacker News new | ask | show | jobs
by toast0 2203 days ago
We used Softlayer (rebranded to IBM Cloud, and affected by this) at my last job. For the most part, their service pretty much just works; clearly not today. :)

We had a couple thousand bare metal servers, and barely used any of their API stuff.

As with any facility, there were occasional issues with electrical transfer switches, core router failures, fiber cuts, etc. Stuff happens, but we got pretty good communication, and things got resolved in a reasonable amount of time. Service got noticeably worse after IBM, but we were already planning to move to our acquirers hosting, because that's what happens when you're acquired. Oh, and their load balancers had garbage uptime.

Bandwidth prices used to be pretty reasonable, but they've adopted AWS style obscene pricing. At least they still let you use the private network for free (including to other datacenters).

3 comments

HN ran on a box at Softlayer until early 2018 or so. This makes me think that the title of this post (which was submitted as "IBM Cloud down as well as their status page which looks to be hosted there") could at some point have been "IBM Cloud down as well as their status page which looks to be hosted there as well as the forum where people post these things which also looks to be hosted there".
HN ran on a box at Softlayer

I've always imagined it as a big tower shoved under someone's desk. The side panel of the case is off because otherwise it overheats. On the screen there's a single maximized window of DrRacket. A post it note warns you not to quit or reboot the system.

And there's a switch set to More magic.
Maybe that's the new location :)
If it was a BGP issue, it's not a problem of where the status page is hosted but instead that you just can't get to it via that name no matter where it's hosted, right?
If by "name" you mean hostname, not really. If you have a domain with multiple nameservers in multiple countries on multiple providers, and your site is similarly globally distributed, at least a few people on the internet will always be able to pull up your site. So at least some of your clients will be able to resolve a domain address from at least some of your nameservers and connect to at least some of your web servers. Geo-IP and Anycast are also really useful here.

edit: It's possible that you could take out an entire TLD and make it impossible to resolve domains on that TLD once all the cached records expire. But that kind of targeted attack would not be possible with a BGP error, unless it was a very specifically crafted BGP error happening over a very long period of time (weeks-months-years depending on the record TTLs).

ok, that makes sense but you'd still at least have: if the site is down due to BGP then so is the status page that is on the same domain.

I guess I'm just calling out the people who are making fun of them for having their status page dependent on the same hardware it's monitoring when it's not clear that's the case just because they are both down?

I would suppose if it's a different TLD domain, then it would be more likely to conclude that.

A status page should be a static site hosted on multiple providers in multiple regions with multiple nameservers. So, Amazon S3 hosted in 2 regions, Azure Storage hosted in 2 different regions, 2 different nameserver providers in 2 different countries using two different backend colo providers. Costs probably <$150/year and that will survive BGP outages, backhaul link outages, hosting provider outages, DNS outages.

I'm not going to make fun of them for their status page being down, but it certainly doesn't reflect well on the brand/products.

Where does HN run now?
You guys were one of the best use cases for the SL model, which really hasn't changed in 10+ years. You had very few dependencies on the less-reliable (read: all of them) services inside the SL stack and mostly managed everything on box and in software. In a few POPs you guys were running about 50% of the total SL backbone bandwidth. There were a lot of sad panda hats when you guys started to transition away.
> There were a lot of sad panda hats when you guys started to transition away.

For us as well. It was so nice to have things work one day and the next and the next, although I guess they wouldn't have worked today.

Favorite firefighting moment was when wdc lost half the fiber in ~ 2014, and we had to move all of our traffic out, so that there was capacity. Our guy asked why we had to move? and your guy said something like 'Because if you guys move, we only need one customer to move.' :D

Yeah the move from FreeBSD to Linux wouldn't have been fun for you guys either. And yeah, the WDC POPs were some of the most overbuilt from a bandwidth perspective and that was almost entirely because of you guys. Pretty sure there's a Cisco sales rep enjoying a nice holiday home in Connecticut as a result of the growth you guys did.
Dunno if I should realize who toast0 is, but what service were/are you running?
Probably WhatsApp, they moved from FreeBSD to Linux and were on SL before the Facebook acquisition.
I don't think you are expected to know who I am :) I omitted the service on purpose, but you my email is on my profile if you want to know.

Apparently it was enough information for dsmcr to properly id the service though; not enough for nixgeek though, I think.

I won't ask that directly. It just seemed like I was missing something that everyone else knew.
Sorry, didn't mean to put you on blast like that.
Don't worry about it. Not a problem at all. Happy to interact with people on the other side of the tickets ;)
They’re talking about WhatsApp.
I had a high bandwidth use case that SL filled when I was a kid - but we went through resellers 10TB/UK2, later 100TB after that was a thing, then they dropped SL and every one of SL's products became AWS-priced levels of insane per GB bandwidth.

The odd thing is, for half the price I could get SL service w/10TB from a reseller, while at list price I only got 1-2TB bandwidth, and sales absolutely would not budge on that. I wonder why.

Last job for me was also a few thousand bare metal servers at SoftLayer. Acquired and moved to that infrastructure instead. Wonder if its the same acquisition? :-)