Hacker News new | ask | show | jobs
by dunk010 817 days ago
I remember this paper, and was at a company at the time that was one of the first to use MapReduce, so saw this all play out first hand. I appreciated the paper. Then and now developers rush to grab new technologies, especially those that stroke their ego-driven fantasies of “working at scale” without considering their underlying constraints or applicability. At the time this was published every company and startup under the sun was rushing to use MapReduce, most often in places where it wasn’t warranted. I’m glad someone surfaced this paper again; people still need to learn the lessons that it outlines. Microservices and k8s: I’m looking straight at you.
5 comments

Map reduce came to my radar around the same time the Trough of Disillusionment hit for some other things, including design patterns. We still believed in the 8 Fallacies of Distributed Computing back then, before cloud providers came along and started selling Fallacies as a Service.

I can’t wait for that hangover to hit us. Its likely to be the best one of my career.

“Wow when we get rid of all this on premise stuff and IT and outsource it all to cloud we are going to save so much money…”

LOL

Companies are now paying tens of thousands a month for compute that could run on a 10 year old physical server. They also have no control and don’t even own their data anymore in some cases.

They’re starting to realize it, but now that they’ve divested themselves of in house IT they no longer have the in house expertise to do differently so they are kinda locked in.

I saw this train coming miles away…

Well allow me to retort. I have done 2 unicorns in the last 10 years and i have 6 new projects going on. None of them would even have been possible without cloud, almost all are under 100$ a month on vercel.

Here's a new one i'd love feedback on https://github.com/webtimemachine/wtm2/

I think the same lack of discipline that causes costs to overrun in the cloud causes costs to overrun in owned hardware. The only real benefit is that you can depreciate the owned hardware for tax benefit but it won't really make owned hardware help you control costs necessarily. The costs will just surface elsewhere instead.
It’s kinda the same problem, isn’t it? If I had a million dollar budget last year and I don’t have one this year, I’ve lost status. Unless the company is flaming out and then lowering my budget is a good thing for a little bit, until it isn’t.

I’ve had maybe three bosses who gave me the recognition I felt I deserved for saving us huge amounts of time and money by chasing something down that other people couldn’t even picture in their heads.

And every single one of those went on to work for a company I absolutely despise and asked me to follow them. As you can guess I have some feelings about that.

I get kudo's for this work when the word comes down that we need to cut costs or else. That is the only time I'll attempt it. If that word hasn't been delivered then any attempt to manage cost in that way is usually met with resistance and communication about "Other priorities".

There are a few places I've worked where cost cutting was valued but they were few and far between really.

Generally I appreciate this post, as yeah the bandwagon effect is real.

I'd characterize mapreduce as a very very specific narrow architectural pattern. Trying to apply it contorts the code you write. I don't see anything remotely like that that's true about Kubernetes or containers (microservices much more so in creating constraints).

We had to reset the Days Since Kubernetes Winge counter again yesterday: https://news.ycombinator.com/item?id=39868586 . And a couple of people spoke to how you might not need containers, but I still haven't heard anyone say what not having containers could win you. What types of code can you only write without containers? We can convince ourselves that Kubernetes is hard, but also lots of people also say it's easy/not bad, so there's some difficulty-factor that's unknown/variable. But I strongly struggle to see a parallel between a strong code architecture choice like MapReduce and a generic platform like Kubernetes or containers.

The platform seems pleasantly neutral in shaping what you do with it, in my view; if that wasn't true it would never have been a success.

> I still haven't heard anyone say what not having containers could win you. What types of code can you only write without containers?

Code with one less layer of abstraction. If that layer is buying you something you need, it's great. But abstraction isn't a positive in & of itself, it's why we get upset about GenericAbstractFactoryBeanSingleton(s).

Containers on kubernetes are major part in allowing me to put less abstractions into code directly, if only because I can reduce bits that would be built-in or in messy companion scripts into more or less standardised patterns in k8s
Containers are not code abstraction, they are environment encapsulation.
What would you change about your coding because of deploying in a container? What do you tell your junior & senior engineers to do differently?

I call bull frelling shit. It's not an abstraction. It's a deployment pattern. It doesn't affect how or what we code.

Endless shitty pointless bullshit grousing, over nothing. It doesn't actually matter. It's just hip, you get to feel good, by pretending you are dunking on sheeples. Frivolous counter-cultural motions that people use to make themselves feel advanced & intelligent. But actually, being free of this pattern doesn't really buy you anything new or different. The same code & the same approaches are viable with our without. It just feels good to be shitty to the mainstream.

If there were abstractions maybe there would be some justification for the slippery slope "oh no!" panic. And K8s is definitely some kind of abstraction. But the sloppy slippery slope "oh no abstractions" shit still pulls no weight for me, fails to acknowledge that sometimes abstraction can & is useful, allows good things.

> What do you tell your junior & senior engineers to do differently?

I tell junior & senior engineers to be careful about what assumptions they make of their environments, and to consider everything they depend on a potential liability.

> frelling

damn I had to dust off some cobwebs for that one

> What types of code can you only write without containers

Lots. 3D, DAW, GUI IDEs, Facetime/MSTeams/Zoom, etc.

Yep, also GraphQL and all those data lakes.

That reminds me "XML is the future": https://www.bitecode.dev/p/hype-cycles

Data lakes are still the standard in the industry.

And that won't change so long as object stores like S3 are so much cheaper.

And what exactly is wrong with GraphQL ? It is better than REST for a number of use cases.

> And what exactly is wrong with GraphQL ? It is better than REST for a number of use cases.

When a response is a representation of a single resource, that response has a definitive caching lifetime. When a response represents a synthesis of a bunch of random crap the user asked for, the response has no clear caching lifetime, and so is uncacheable.

Most of the scalability the web has achieved, rides on the back of response caching at one level or another. Even low-level business-backend vendor APIs can—and are often designed under the assumption and requirement of—being cached. GraphQL throws that property away.

GraphQL seems like a great idea if you're a massive social media company looking for ways to empower devs to stamp out flavor of the week social media widgets without being tied to existing APIs. I'd bet most companies doing GraphQL are just doing uncacheable things that REST APIs could do fine.
Almost all graphQL clients implement a custom caching layer (for example using indexeddb in the browser) that can cache resources with different timelines, also when they’re returned in a single response.
Yeah, but that's client-side caching, which doesn't get you much in terms of scalability. Scalability comes from transparent backend reverse-proxy caching — Varnish and the like. You can cache the resources that go into a GraphQL response, but you still have to build the GraphQL response each time — and that can become a problem, as various should-be-trivial parts of each request's backend lifecycle start living long enough during high concurrency request loads to destroy request-queue clearance rates.
Yes in the same way micro services are a standard in the industry.

What's wrong with graphql is what's wrong with k8: people use it in the wrong context, and the right context is not most projects.

The future was better, but also worse: JSON.
Kubernetes is not like mapreduce. It does not need microservices at all. It is a scheduling and deployment framework, which you will implement yourself anyway (hopefully you do) or you use a pass. It’s not even that hard to work with it. Of course it is complex, but a lot of these tools are, even the lower level ones like terraform.
Between monoliths and microservices you have services and sidecars. If you don’t at least have sidecars I really don’t see the point of kubernetes, because most of the rest of the services will follow Conway’s Law and can reasonably do their own thing for less than 125% of the cost of full bin packing.
Things like CertManager and ExternalDNS take away so much operational time dealing with those two things alone. There’s a lot of good infra automation. That being said, I’m a lot bigger on the Fly Machine, Compute-at-Edge trend. If only they had a good IaC solution (after abandoning Terraform and the lack of features in fly.toml).
I've worked on a lot of very large Kubernetes projects and none used sidecars widely.

The major benefit of Kubernetes for them was that you could use lots of cheap, ephemeral cloud instances to run your platform whilst still having high availability. It ended up saving a ridiculous amount of money.

How many services are you running?
Hundreds, in my case. We have a sidecar here or there, but essentially our entire operation is run in distroless containers that consume config maps. We have one source of truth for 5 baremetal cloud regions, a number of private on-prem cloud regions, our build and test infrastructure, and nearly everything else, it is our Argo repo and the auto-generated operator manifests from our operator mono-repo. We have a common client library that abstracts our CRDs into easy to consume functions, and in the end using Kubernetes as an API for infrastructure operations does exactly what it should; allows full consistency and visibility on configuration.
Yeah, so you are not in that in between place I was talking about, but you’re arguing with me anyway.

You do you, I guess.

you did not even understood what k8s does. trust me you want something like nomad or k8s. else you will write your simple k8s thing anyway, its just harder to understand since you just wrote your own solution. k8s even works in small scale, heck it didn't even run at first in really big deployments.
Which problem is more serious? 1) your small company has an over-complex system that could have been postgres; 2) your medium-sized company has a postgres that's on fire at the bottom of the ocean every day despite the forty people you hired to stabilize postgres, and your scalable replacement system is still six months away?
#1 is more serious. #2 limits the growth of your already successful company. #1 sinks your struggling small business. You have to be successful to be a victim of your own success, after all. Not to mention the fact that #1 is way more common. Do you know how far Postgres scales? Because it's way past almost any medium- scale business.
Exactly. A lot of us work at #2 so we wish our predecessors saved us our current pain. But if they went that route we wouldn't be employed at that company because it wouldn't exist
Exactly, if a medium-sized company is struggling with Postgres, either they have very niche requirements or the scalability problems are in their own code.
What about #1b: you have an overly-complex "system", but most of that "system" is serverless (i.e. managed architecture that's Somebody Else's Problem), with your own business-logic only being exposed to a rather simple API?

I'm thinking here of engineering teams who, due to worries about scaling their query IOPS, turn not to running a Hadoop cluster, but rather to using something like Google's BigTable.

Sounds like a best practice to me?
Probably 3) the system you overengineered too early solved the wrong problem, and your replacement is six months away, but you've paid for it twice.
I have very rarely seen the second scenario, but the first seems more common.
Isn't the second example representative of all tech debt / neglect ever? If so, it's very common.
In the second scenario, they can't do math. They could have bought themselves 6-18 months by getting the most powerful machine available using probably at most 1-2 salaries worth of those 40 people.

Less a single digit percentage of workloads needs massive, hard to use horizontal scale out (for things that can solved on a single machine, or a single database).

MR is useful as an adhoc scheduler over data. Need to OCR 10k files, MR it.

Hadoop was the worst possible implementation of MR, wasted so much of everything. That was its primary strength.

Very early on in my enterprise career, in a continuance of a discussion where it was mentioned that our customer was contemplating a terabyte disk array (that would fill an entire server rack, so very fucking early) I learned about the great grandfather of NVME drives: battery backed RAM disks that cost $40k inflation adjusted.

“Why on earth would you spend the cost of a brand new sedan on a drive like this?” I asked. Answer: to put the Oracle or DB2 WAL data on so you could vertically scale your database just that much higher while you tried to solve the throughput problems you were having another way. It was either the bargaining phase of loss or a Hail Mary you could throw in to help a behind-schedule rearchitecture. Last resort vertical scaling.

Reminds me when I had a 3-machine Hadoop cluster in my home lab and 2 nodes were turned off but I was submitting jobs to get and getting results just fine.

I remember all the people pushing erasure code based distributed file systems pointing out how crazy it is to have three copies of something but Hadoop could run in a degraded condition without degraded performance.

I agree. I used Disco MR to do amazing things. Trivial to use, like anyone could be productive in under an hour.

Erasure codes are awesome, but so is just having 3 copies. When you have skin in the game, simplicity is the most important driver of good outcomes. Look at the dimensions that Netezza optimized, they saw a technological window and they took it. Right now we have workstations that can push 100GB/s from from flash. We are talking about being able to sort 1TB of data in 20 seconds (from flash) the same machine could do it from ram in 10.

https://github.com/discoproject/disco

I need to give Ray and Dask a try.

I don't know where to put this comment so I'll put it here. DeWitt and Stonebraker are right, but also wrong. Everyone is talking past each other there. Both are geniuses, this essay wasn't super strong.

If I was their editor, I would say, reframe it as MapReduce is an implementation detail, we also need these other things for this to be usable by the masses. Their point about indexes proves my point about talking past each other. If you are scanning the data basically once, building an index is a waste.

No, plenty of tech debt is caused by over-engineering or pre-maturely optimizing for the wrong thing.

I'm not sure if the second outcome is meant to blame Postgres specifically on under-engineering in general, but neither seems to me like it should be a concern for an early-stage startup.

I generally classify tech debt more as a long todo/wish list that we'll never get a chance to work on rather than a server or service being on fire.
I have found that these fires become uncontrollable because of tech debt. Whole rarely the spark, it’s a latent fuel source.

It’s like our modern forests; unless something clears out the brush, we see wildfires start from the smallest spark. Once it starts, it’s almost impossible to do anything but try to limit the extent of the disaster.

This was true in 2009. Since then, multiple PostgreSQL-compatible databases have launched.