Hacker News new | ask | show | jobs
by ris 2898 days ago
I'm quite tired of everyone wanting to build "large scale systems" and play at being Netflix. The truth of the matter is the vast vast majority of people will never need to do this with their project and instead will just end up making an expensive to maintain mess with way too many moving parts.

At least as important as designing something that can scale up is designing something that can scale down. You don't know when the organization's going to need to deprioritize this project and be able to keep it running without burning a couple of million in resources every year.

See: microservices. (as in, for the problem, not the solution)

12 comments

Exactly. The cost of maintaining a complex distributed architecture cant be understated. Its frequently viewed in terms of technical tradeoffs but the real killer is the legion of smart people in devops and systems architecture you'll need to support it.

Over-complicating things is endemic though. An aside to illustrate: Our work straddles multiple non-tech industries. Theres a common theme in software there. A thin veneer of modern tech companies on an ocean of legacy systems, mostly running off a single PHP server in a backroom somewhere.

Everyone wants to replace them, startups & users. But we see time and time again startups being limited by opinionated choices in their architecture. IE a focus on fanciness VS providing the functionality thats needed. Not just distributed systems, but things like teams struggling with react front ends, designing apps where websites will do, custom CSS where a template will do.

It stems from a common misunderstanding. Its not your tech that makes a great product. Your great product is enabled by great tech. SAAS systems that displace legacy enterprise systems do so mostly because of business models and functionality, not amazing technology. Netflix wouldn't need their architecture if they didn't have the users and the content. etc.

I think there are definitely lessons to be learnt about building for scale yet being nimble at start. I.e if you are on GCS/AWS you can build something that costs 10’s/month and can be scaled relatively easy to handle millions of customers if such a thing were to happen.

Kind of boils down to simple things. Put assets in S3 or GCS. Have DB separate to your app, preferably if it’s in prod and you have paying clients then have at-least 3 replicas. So if one goes down or you need to do some upgrades, everything goes smooth.

You probably want to dockerize your app so you can deploy the same thing to stage and prod. It’s scary how very few companies have a proper staging environment.

But none of this matters. The first part is dead simple but hard to do “build something that people want”. Everything is secondary and useless if you’re building stuff that nobody wants.

>I'm quite tired of everyone wanting to build "large scale systems" and play at being Netflix. The truth of the matter is the vast vast majority of people will never need to do this with their project and instead will just end up making an expensive to maintain mess with way too many moving parts.

Few companies will take a product that actually needs large scale systems and hire someone that has no prior experience.

If you want to actually build large scale systems, you have to start somewhere.

Even if you just want to be an entry-level person on a team that builds large scale systems to learn by experience, they are likely going to ask you questions about that topic.

You may not need that many people to build large scale systems, but you still need a pathway as people leave that particular niche.

Fully agree with you here. I agree with OP of the comment that yes, most startups don't need this. But tons of companies _do_ need to scale. And those are the ones willing to pay someone who knows their shit the big bucks.

No, the company that I work for isn't Netflix, but it still has tons of customers. One of our services regularly pushes past 100k rps, and knowing much of what is covered in this guide has been incredibly helpful over my career.

When I interview people, I put as much if not more focus on being able to come up with a sane design as I do coding. Especially for a senior engineer.

> Few companies will take a product that actually needs large scale systems and hire someone that has no prior experience.

No I think most people end up hiring those who have experience creating big complicated systems but haven't stuck around long enough for their chickens to come home to roost.

That's an important point, especially with the oft-repeated statistic of 2-years as the average tenure of an engineer.

Of course, averages (even if true) are like stereotypes.

It would be interesting to see the tenure data on the experts (consultants/implementers) of large-scale systems, other than at the iconic ones (e.g. Google, Netflix).

I think it likely that people with large-scale experience who aren't at Google would have lower tenures than average, simply because they're becoming more valuable and most companies don't pay people their replacement wage if they've been there very long.
The effect that you mention is already cited for the trend of lowering average tenure of technical professionals, in general, so, absent specific evidence that this subset's market value differential (market value less existing employers' willingness to keep up) is increasing faster than average, there's no reason to believe that's the reason for a shorter than average tenure.

We don't even know if the tenure is shorter than average.

Regardless, neither the primary motivation for a short tenure, nor even any average would be particularly meaningful with regard to what I believe to be ris's implied accusation:

Absent at least one tenure long enough to see through the consequences of the creation of the large-scale system, such a creator cannot be truly considered experienced with large-scale systems, no matter how many such creations are on the resume (even though the market values/hires the latter).

I think you underestimate how many people work at large companies that have to deal with these problems. Not everyone is in the HN startup bubble working for a company with barely any customers.
I think you highly overestimate it. Most large companies are a series of small groups that act as companies that have nearly trivial (to the point of absurdity) engineering concerns.

Sure, a few groups in each F500 need epic skills - but I think that's an exceedingly vanishing amount of the (actual) work that is being done. The term Enterprise and what it stands for earned it's laughable reputation for a reason.

I've worked at smaller companies. Even still, it's not hard to hit relatively large amounts of data, depending on the field.

As an example, any kind of analytics could generate terabytes of data a day... per customer.

A side project I am building will have to handle billions of events per day. Per customer. There are 0 customers (this is for fun, not profit), but as soon as it would hit one customer I would need to consider an approach that scales.

How many companies have similar requirements?

But that's actually besides the point.

Microservice architecture, or any architecture that focuses on isolated, asynchronous components, adds complexity. Of course.

But it also reduces work in other areas. If you build async, isolated services, you no longer have to deal with catastrophic service failure. Cascading failures go away at the async bound.

For many of us, I imagine we've spent a lot of time fighting fires at organizations where one service going down was a serious problem, causing other services to fail, and setting your infrastructure ablaze. Hence a bias towards solving that problem upfront.

> As an example, any kind of analytics could generate terabytes of data a day... per customer.

Wait, what? I've never worked anywhere where one customer generated terabytes of data per day, and I've worked on very large commercial enterprise software.

The only thing I have experience with that produces anything close to that kind of data per customer is in genetic sequencing, and you only do a customer once. (Even that isn't a TB in its bulkiest, raw data form, and the formats used for cross-customer analysis are orders of magnitude smaller).

> For many of us, I imagine we've spent a lot of time fighting fires at organizations where one service going down was a serious problem, causing other services to fail, and setting your infrastructure ablaze.

The reason so many of us have worked in places like that is that those places 'got stuff done' and survived and grew.

I was also very confused by OP as well. I’m thinking maybe by customer he means a business using his analytics service. That would explain why it needs to scale for his first customer, and why one customer could have terabytes of data.
My comment was definitely biased towards big tech companies where I have worked so not all of the fortune 500. That being said, I have worked on a 2 and 5 person team each managing hundreds of terabytes of data and both companies have tens of thousands of engineers.

Both companies are just are part of the large tech scene hence my skepticism about there not being many engineers that have to manage tons and tons of data/distributed systems as there are probably hundreds of thousands if not millions of engineers that have to think about these problems outside of the two companies I have worked for.

Building a system out of a handful of services, databases, caches, and queues is table stakes for a backend engineering intern, not "epic skills."
And knowing how to architect it so that it actually scales well is beyond most senior engineers. You underestimate the difficulty I think.
It has given me peace of mind at work a couple times when I thought or said to management: “just bring the cluster down to 1 node, you can still support X users, and your server bill will be $500/year”

Still probably designed more system than needed (cluster). But scarier than that was seeing some DC/OS apps with $5,000/month in server costs even without user load.

I don't think there is very much shortage of online tutorials and blogs showing how to create a basic Rails/Python/Node/whatever MVC monolith type web application backed by a RDBMS. Looking back at my career I can find plenty of valid use cases for needing to understand distributed computing for the most un-sexy of computing problems. One example was how to share a student's data in education software across school districts that each require hosting just their data in their own data centers. Another is making some boring CMS application highly available since you're customers are big paying Fortune 500s. Knowing this stuff helps IMO.
I'm not sure I agree. The most important thing about designing large scale systems is dividing the total work flow into self contained pieces with easily inspected separation points.

In the grand scheme of things this doesn't have to mean microservices across a million hosts, only that you've decomposed the problem into it's elemental parts. Those parts can now be considered separately as their own elements rather than having to contend with the entire architecture in your head when a problem arises.

I never thought about scaling down as a skill until just now. I kind of assumed "scaling up" implied up && down, our maybe "scaling out" implied out && in. Interesting thought.
Being able to whittle down and simplify is an excellent skill to learn as a developer. It's my favorite and the one I find most fun.

It allows everyone to focus on their specific components without leaping ahead in assumptions about how each developer will use each piece in the future. Lots of those kinds of problems are more easily solved in a room together, planned out, and done together. At least, that's what I've learned from how NASA developed their most important, complex parts.

It's very easy to get ahead of oneself. Complexity grows by factors that are incredibly difficult to manage. Being able to simplify down to a context of parts that are moving and parts that are stable is a serene state of coding. Everything flows much easier that way.

There will likely always be bugs and issues, but minimizing them to the smallest number there can be is an ideal value to maintain in software development.

It is indeed interesting to consider things like connection draining and playing nicely with the LB. Even in scenarios where machines are just removed for non-scheduled reasons.
It's odd because there seem to be two conflicting trends. On the one hand, you have people embracing (say) javascript as a server platform because it's easy to get something done, and simultaneously have people designing for outlandish scale.

In general, the 'get it done' mentality is the one that makes economic sense, because once you've added together the pile of software that doesn't need to scale to the other piles where this long-view doesn't matter, you have almost everything built. The other piles, for the record, include software:

- that is designed wrong so it needs to be re-written

- is obsoleted by changes in business direction (a project canceled, for example)

- gets replaced by something off-the-shelf or open-source

- is built for a startup that won't survive, or that gets aqui-hired, or that pivots to a wildly different thing

On the other hand, I sometimes see the opposite thing in heavily analytical work, where data science work is done in Python because its "easy", and then a team of engineers builds a crazily complex pipeline to make the python perform in some reasonable time frame. (Hi, Spark!). In my workplace, one example allocates bits of a job to roughly 100 machines, moving data to each, in a cloud environment where the data movement overhead is constantly fighting the benefits of distribution.

> then a team of engineers builds a crazily complex pipeline to make the python perform in some reasonable time frame. (Hi, Spark!).

Having seen at least a couple of similar setups, I remain skeptical that this isn't, at its core, just a problem of ignorance of how "big" one can make/get a single server, before even paying a premium.

However, even for the "largest" commodity servers, last I looked, the premium at the highest end (over linear price:performance) was only something like 4x.

There was some relevant discussion of single server versus distributed in subthreads of https://news.ycombinator.com/item?id=17492234 a few days ago.

> In my workplace, one example allocates bits of a job to roughly 100 machines, moving data to each, in a cloud environment where the data movement overhead is constantly fighting the benefits of distribution.

I'm confident that cloud environments contribute to hardware ignorance, since cloud providers offer a very limited choice of options, and I have yet to see anything high end.

This is especially a frustration for me with networking options, where high bandwidth (beyond 10Gb/s on AWS, until recently, and still only 40GB/s max, AFAIK) is nonexistent and, otherwise, expensive, and low latency options like Infiniband don't seem to exist, either, even at the now low/obsolete bandwidths of 16 or 32Gb/s.

So true. A transparently priced PaaS service would have fixed this problem long back. Most of the users can then simply trust the service to automagically scale up if at all when required and pay on the same lines of a custom configured IaaS solution.
Afaik that was app engine is mostly like that. Cloud functions are sort of too, but only for the HTTP traffic (not the supporting infra like db etc)
There is a discrepancy between joestextile.com which will get it's first 500 regular customers in 3 years and a unity plugin that needs to scale to 10 MB/s when it gets it's first 100 regular customers in a week. Both kinds of projects are pretty common. The first one can be done in a week with RoR and a run on a 10$/mo hosting solution for years, the other requires pretty much all of the above.
Worth understanding how to build something compact, and have a clear roadmap for growth. Knowing where to monitor for bottlenecks and how and when to tease out functions to mitigate them. In that context understanding a large system can provide insights. A curious technology manager (who pays bills) can ask informed questions.
Some of us want to eventually work at companies that do use large scale systems like this, so we like learning about it.