Hacker News new | ask | show | jobs
by LamaOfRuin 4052 days ago
The idea that Google was industry leading on non-batch loads in 2013 seems wrong to me. They were not selling those services then, so they did not have a positive profit motive to optimize that usage (only a motivation to cut costs, which I'm told is not nearly as effective). Amazon has had that motivation (and necessity with their non-existent margins in every other part of their business) for long enough to actually accomplish something.
1 comments

at Google's scale, one doesn't need a lot of incentive to improve utilization. Every IT shop has wanted the cost reduction of improved utilization since the dawn of the PC era.

The difference is in process. Google's approach to workload placement is automated by software, driven by engineering decisions and data.

Many IT shops' placement is political (new servers = new capital = power).

At Google's scale you need more much incentive to get anything done. This is even more true when it is something that will touch every division, product, and service.

What every IT shop wants doesn't necessarily relate in any straightforward way to what any IT shop invests resources in getting. Every IT shop prioritizes many other things above utilization (and are right to do so).

All decisions, engineering or otherwise, are political. Different environments involve different politics, but it's all still politics.

All decisions are political (ie. Power interests), but not all orginzations are configured to be primarily driven by power. This is especially true for young organizations, or those that have gone through a cycle of renewal.

Google decided early on to drive towards an operational architecture that allows individuals to act at scale on their infrastructure. A developer deploys into production, it launches thousands of new containers and disposes thousands of old containers. A batch job is run, same thing. Deploying services is uniform across the board. Thus, optimizing utilization through improved container scheduling is something that the core site reliability engineering team could do independently of individual services.

Google's early adoption of data center sized computing by Hozle & team was unique, along with Amazon's CEO-diktat move to decentralized service-oriented architecture, or Netflix's rewrite and move to cloud. Which is why you have articles like this, written by a VC, that want to repackage this thinking and sell it back to old school IT.

> Thus, optimizing utilization through improved container scheduling is something that the core site reliability engineering team could do independently of individual services.

But is that something it is known they prioritized, or was there perhaps more interest in optimizing the efficiency of deploying thousands of containers on every deploy, across data centers, with reliable testing, without killing in flight processing, and scaling for subsecond response to bursty demand? Who sets the priorities for what is most important, and how much of one they're willing to sacrifice to improve physical utilization?

I have absolutely no doubt they had as many resources as any other company dedicated to finely tuning their data centers and related infrastructure. I question whether they had the same motivation as a company like Amazon (who was deriving direct profit from selling this resource) to prioritize the optimization of utilization.