Hacker News new | ask | show | jobs
The $10m Engineering Problem (segment.com)
182 points by fullung 2438 days ago
15 comments

Disclosure: I work on Google Cloud.

Awesome writeup! I’ve seen lots of customers do similar “let the packets spray” on both GCP and AWS. Interestingly, it’s one of the reasons I was so excited for our “ILB as next hop” [1] feature. Routing to the Service within the same Zone, unless there’s a failure at which point you wish to go elsewhere in the Region is a common pattern. I’m excited to see where Traffic Director and similar service mesh patterns lead in this space. Having everyone need to roll this by hand, seems needlessly redundant.

As an aside, another personal surprise when I got into an argument over Zone-to-Zone pricing: GCE only charges for one-way while AWS charges for send and receive. We should clearly do better marketing :).

[1] https://cloud.google.com/load-balancing/docs/internal/ilb-ne...

gcp services are always great on marketing booklets but tend to have less than stellar reliability and you gotta read the fine print. A perfect example is ilb only supports max 250 backends and is basically unusable for mutli-regional setups.
340 VMs is maybe at most, 2 racks of equipment.

So figure 4 racks, 2 racks in each of 2 locations. That's not even 500k in equipment. Telecoms run 'tandem' and that is good enough even for 911 infrastructure.

10gb of decent quality internet at each location is another 2x 5k per month. Power space etc. and remote hands is 2k x 4 racks is 8k per month.

So 500k capex plus 18k per month. And how much are they paying AWS?

I gave up trying to spread the good word... it's like gramps telling the youngins to build a nice house somewhere scenic, good bones, raise a family... instead they go live at a WeWork, and rent Ikea furniture by the minute... the tide may turn as money gets less and less cheap.

Still, the engineers gotta feel good about saving all that money, and the environment may feel good about the energy savings too. I hope they get a nice bonus for rolling up their sleeves.

It's not quite as simple as that. User boulos's comment below adds more to the estimate given and includes things that tech companies these days don't want to contend with / have flexibility towards.

In general, you're correct, but in colo-skills don't exist with AWS/GCP power users today. Management and skills of tech companies today don't even know where to start for hiring colo-skills. So it's all avoided w/the cloud tax.

Righto and while you’re setting all of this up (which usually takes months) your business is going to competition who just spun up a few nodes with a couple lines or terraform (or just autoscaled to demand). Also 10g link is hilarious - you will get x30 that on gcp for this many cores.
This is awesome. I'd love to see more blog posts tying business value to engineering problems, finding ways to measure before/after outcomes, and then sharing the engineering details. What a great post.
> Then when a reader connects, instead of connecting directly to the nsqlookupd discovery service, the reader connects to a proxy. The proxy has two jobs. One is to cache lookup requests, but the other is to return only in-zone nsqd instances for zone-aware clients.

> Our forwarders that read from NSQ are then configured as one of these zone-aware clients. We run three copies of the service (one for each zone), and then have each send traffic only to the service in its zone.

Isn't this the default behavior of ELB/NLB to begin with? Why not just configure the zone-aware clients to call zonal LBs, instead of hosting your own LB? Same with Consul. I'm not understanding what benefit Segment gets from using Consul vs. calling EC2 Metadata API to discover the AZ and then calling the appropriate zonal LB endpoint...that's not hard to do and avoids many extra dimensions of operational complexity.

It's also unclear to me how all this migration to intra-AZ routing affects Segment's resilience to AZ outages.

Consul allows transparent failover to be built in easily. So it can prefer your AZ-local service, but if that becomes unavailable, it can fail over to the next-nearest service, be it in a different AZ or an entirely different region. The direct lookup you describe would not be able to handle failover in an intelligent way. Consul can also provide DNS automatically for your services, route based on network tomography, and the latest versions can provide automatic mTLS between services, and descriptive network security rules. Not to mention providing a handy place to store config state and send events.

Beyond that, ELBs have a significant cost if you are running multiple for each internal service you might have, and the API is slow and cumbersome compared to dealing with Consul's service-centric API. From an operations POV, Consul's ACL system is also a lot more flexible than what AWS IAM can provide. So you can be sure your services are limited in what they can claim to be and what gets set up on their behalf. Whereas if you want to automate creation and configuration of ELBs, you are going to have to either grant more access than you really want or you'll have to abstract that behind another service that you have to write.

As for AZ outages... in practice, a cross-AZ system is often just as vulnerable to problems from the outage of a particular AZ, especially if any autoscaling is involved. AWS's tools around this are severely lacking, despite what they tell us about resiliency best practices. But it all depends on the architecture and mostly the data layer.

> As for AZ outages... in practice, a cross-AZ system is often just as vulnerable to problems from the outage of a particular AZ, especially if any autoscaling is involved.

If a system is not resilient to an outage of a particular AZ, by definition I would not call it a 'cross-AZ system'. Maybe what you have in mind is systems that in practice _think_ they are cross-AZ resilient but are actually not when you look closer?

The EC2 Metadata API isn’t meant for high-throughput calls, so it’s possible to hit rate limits even from moderate polling once you get enough nodes involved.
Why do you have to constantly poll it? It's once on startup to discover the zone it's running in.
The AZ for an instance is fixed. Check it at startup time, and cache it in-memory.
Buying rack space at a colo costs money, but if you are spending millions of dollars on AWS you will likely end up spending a few hundred thousand including a salaried sysadmin to manage the hardware.

This does mean increased management complexity, so you have to build out an operations team. The total for salaries will be around 400-600k.

In the end you will have some setup costs and you will have to choose a subset of the features AWS offers, but you'll save millions of dollars per year and have much better performing hardware and much, much more flexibility.

AWS is extremely expensive.

Disclosure: I work on Google Cloud.

The blog post doesn’t make it as direct, but one of their biggest costs was for networking between datacenters (Availability Zones in AWS). Most comparisons for “buy a rack at a colo” assume one colo, and a static fleet of hardware.

If you wanted to compare apples-to-apples, you’d need to have (at least) three nearby colos with enough capacity to handle one going down entirely at peak load (“N+1”). Leased lines in a metro area aren’t actually all that expensive, but like the compute, you also need to purchase that with failure in mind.

tl;dr: Maybe, but the analysis needs to assume the same(ish) reliability outcome. Otherwise, they could have avoided lots of cost by just running in a single Zone.

> If you wanted to compare apples-to-apples, you’d need to have (at least) three nearby colos with enough capacity to handle one going down entirely at peak load (“N+1”).

Not true if it's possible to fallback to cloud. That way we can have both high reliability and low cost (other then during outage/maintenance of collocation).

Hmm. I read the comment as saying “no cloud, because you’ll save so much by just being on-prem”. And I think an “apples-to-apples” comparison requires an N+1 setup including both compute and networking.

Hybrid could be many different setups, but before their “zonal affinity” change it would actually be worse, right? (Egress over Direct Connect is 4x higher than Zone to Zone, while “internet” egress is 8x). What are you assuming for the balance of Compute and Networking across at least three “sites”?

> Hmm. I read the comment as saying “no cloud, because you’ll save so much by just being on-prem”. And I think an “apples-to-apples” comparison requires an N+1 setup including both compute and networking.

That is valid interpretation. I just wanted to say that is you need high availability it might be cheaper to have one colocation and cloud in standby.

> Hybrid could be many different setups, but before their “zonal affinity” change it would actually be worse, right? (Egress over Direct Connect is 4x higher than Zone to Zone, while “internet” egress is 8x).

Yes, in/out traffic would be one of more problematic points of such setup, but there should be some solutions available (BGP?).

> What are you assuming for the balance of Compute and Networking across at least three “sites”?

Least expensive should be zero compute in cloud unless there is issue with collocation. Depending on specific scenario, some storage/databases would have replication to cloud. I don't know how I would setup networking in such case.

> Least expensive should be zero compute in cloud unless there is issue with collocation.

One more thing: cloud can be great to scale up in peak utility without buying servers that will idle most of the time. It's just that using only cloud might be much more costly, even if it is easier.

Always interesting to see the scale you have to hit before rewriting from one language to another saves money (relative to engineering cost).

With node.js: 800 containers, with each container processing 250 messages per second

With golang: 340 containers, with each container processing 650 messages per second

Say each one of those containers cost $0.02/hr then that's order of $100k/year saved!

Considering a typical HCoL junior dev costs about ~100k/yr, if you can have one junior dev rewrite your entire codebase in a year, you'll breakeven in cost after 2 years. Considering a senior dev costs 2-3x that amount per year, as soon as you have one of those involved for an entire year (odds are, if it's business critical software you will), your breakeven point comes out to just under a decade worst case.

I think that just illustrates how risky rewrites are. Very few companies at that scale can just rewrite everything in that timeframe using that little resources. Many companies don't even have codebases that will survive a decade.

We considered a few options before rewriting it. I've got a draft blog post about the process laying around, but haven't gotten around to getting it over the finish line.

We definitely knew the risk going into it. Fortunately, it only took us 2 months to rewrite it. I think our strategy for the rewrite is directly responsible for the speed at which we rewrote it.

If you have one junior dev rewrite your entire codebase they'll never finish and the quality of the rewrite will be terrible. It's right in the definition of 'junior'.
The fully loaded cost of even the most junior dev in a high cost of living area is going to be well above $100k/year. Gross wages are generally only 1/3 to 1/2 the all-in cost.
Yeah, but junor from HCoL is not the only option, so why would you consider the most expensive one?
Thats all true of course assuming it'll take (<>?) a full year to do.
(Not disagreeing and is somewhat off the article topic to the article but wanted to counter point as engineers love to find reasons to rebuild in new stuff)

So you could also argue for that savings it may not be worth it. 100k is less then 1 engineer in most places (especially including TOTAL hire costs like equipment, office space, benefits, ect).

You can also argue it is much harder to hire golang engineers (or invest time/money into training engineers in go) so that that savings may not be worth it depending on the time it takes to rewrite plus the gamble of failure.

Also you need to ask could those engineers doing the rewrite have been working on up-sell features or other products that could make more money?

Would the investors (especial new growth oriented investors) care more about a increase in margins or a even larger multiple in ARR?

This was a great engineering blog post in that it did a good job in describing, in detail, a large overlying problem (excess intra AZ bandwidth), its impact on margin (20%) and concrete steps to measure the problem and solution. This is exactly the type of communication we should be able to use as an example of the outsize effect engineering can have on the long term value of the company -- if there's anyway that margin increase can in some way be turned into CAGR, these increased margins could double the company's valuation in 4 years (in an ideal world, of course).

Very cool.

It's better to use multiple regions instead of multiple zones in a single region. The costs are very similar (and sometimes even the same) especially with the ridiculous networking fees.

Also object storage is a great way to expand capacity for queueing systems instead of oversized instances. We either write to Kafka or fall back to writing files to S3 across different buckets and providers.

Does anybody else find it funny when companies talk about typically private metrics (like gross margins) very publicly.
this looks as much as an engineering problem as a learning problem (how to build systems for cloud) and management (how to track and establish better quality in the whole product lifecycle). nice that they are learning their stuff still and having fun.
They have a Gross Margin Team??? Wow
Segment sounds like the kinda business that should presumably just serve up 403 errors for all EU traffic. Data laundering analytics to 300 external tools is shitting on the GDPR.

(Remember this when reading the article: all the traffic, all the VMs, all the megadollars spent on AWS here are doing nothing but tunnel (replicate) analytics data to third-parties, all of whom would be perfectly happy to receive it directly. It is the definition of waste.)

Among other reasons to architect it this way, having the client (web browser) connect to each analytics provider directly pushes the work to the least reliable, most network-constrained, and least manageable node in the network. Segment lets you have the client do de minimis work and have the heavy duty transfer (and retries, etc) happen from somewhere in AWS, where they're not connected over a 3G connection. That isn't waste, contingent on the company or the user getting value out of analytics and analytics-driven decisionmaking, which is quite plausible.
not to mention that many businesses have multiple client platforms (web, iOS, android, etc), so implementing anything client-side immediately multiplies the dev spend.
A $10m problem is 1,000,000,000× smaller than a $10M problem.
Its a bit silly that moving something from one app to another is so complicated.
>>> As a concrete example: a single Salesforce server supports thousands or millions of users, since each user generates a handful of requests per second. A single Segment container, on the other hand, has to process thousands of messages per second–all of which may come from a single customer.

This sounds like the basic problem with Big Data and selling advertising as a business model ... that eventually even bits aren't free.

I can see how it happens - but I think any business that has as its core ship everything to our servers in San Francisco is just badly architected - and if that's your business model you have a bad business model.

no particular comment on segment but a general thought - perhaps most of the business models today are not very good ones

(I seems to remember a rap lyric start up that spun up a new single threaded Ruby on Raiks instance for the most trivial request increases)