Hacker News new | ask | show | jobs
by moreira 2615 days ago
A Slack team is a self-contained unit, however, if I understand correctly. It -should- be easily horizontally scalable (please correct me if I'm wrong). Each team could have its own database, its own app servers running on whatever region(s) was/were needed. So it's not like they would have some mammoth central database that requires strong scale engineering. Furthermore, you know in advance how big each team is because they all pay you for X users, so you can allocate resources to them appropriately.

Lyft's AWS bill (from their S-1) is much higher, but their application has very different scaling constraints to something like Slack, it's not as easily horizontally scalable. Even though their bill is high, I suppose it can be hand-waved away as "oh scaling's expensive".

And a lot of the redundancy/security comes built into AWS services. S3 has redundancy built in, there are Multi-AZ RDS instances with easy support for at-rest encryption, and there's container orchestration these days for easily handling app server redundancy and worker servers. So a company starting out, like Slack, just a few years ago, would have access to all of that without much additional overhead.

I'm seriously fascinated by what it is that makes it so expensive. I suppose the real explanation might just be that there's no incentive to optimise for costs. It's like Slack's own app: A native app -could- be built that is super efficient and light, but there's no incentive to optimise for that.

1 comments

Small slack teams are easily horizontally scalable; for a small team, the web server, the app server, and the db could probably run on a single EC2 instance, and AWS offers some rather large instance sizes.

Lets start there, though. 70k stand-alone (paid!) slack teams means 70k stand-alone systems. How do you operate, well, all of them, simultaneously? With one mammoth central database, there's one database to upgrade; if it goes down, there's one database to fix. With 70k small databases there's 70,000 problems! With 70,000 systems, how do your engineers deploy code, and how many times per day can they do it (it had better be well into the double digits)? How do you roll them back? What do you do if an upgrade goes wrong? With 70k different apps, one small problem quickly becomes 70k small problems, which is harder to manage than 1. Some things can (and I'm sure are) scaled horizontally but the isolation that grants you does not come for free.

And then, what about past that? Looking at the customers listed on Slack.com, they serve some larger enterprises, who are going to need the "expensive" level of scaling. No database is going to be able to scale to that level without team to manage it (no matter the technology), so then you need a queue as well as a db, plus a team to manage each of those, and then how do you do searching/indexing. You also can't ever take a single database node offline, so then it's a database cluster, with hot spares, and also large enterprises operate globally so then their slack team system needs to run multi-region hot as well, and then and then and then? I've got Slack open all the time on both my (work) phone and my (work) laptop as do the majority of my coworkers, which means their webservers have heavier requirements compared to Lyft, which I use for a few minutes whenever I take a ride.

Slack usage will hit a lull outside of business hours, so you'd want it to scale resources that serve that - I'll bet a non-insignificant portion of the $4M/month probably goes to resources that are only used during the business day - so in some sense, Slack is paying AWS a premium to not pay them for unneeded resources at 3:30 AM.

Slack's optimized their app for development cost (much to my laptop's sadness), it doesn't seem that far fetched that slack has also done some optimization of server side costs. future money isn't worth as much as money today is, and this fact is reflected in AWS RI offerings.

>so in some sense, Slack is paying AWS a premium to not pay them for unneeded resources at 3:30 AM.

If you're running a server for 1/3 of the day (8 hours), you're probably better off using dedicated instance (60% discount with 3 year reservation) than trying to optimize around on-demand instances (66% discount with perfect allocation, ignoring the engineering cost). The economics are even worse if you consider imperfect allocation, or consider self-hosting (probably cheaper if you're as big as slack).