| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sitkack 817 days ago

In the second scenario, they can't do math. They could have bought themselves 6-18 months by getting the most powerful machine available using probably at most 1-2 salaries worth of those 40 people.

Less a single digit percentage of workloads needs massive, hard to use horizontal scale out (for things that can solved on a single machine, or a single database).

MR is useful as an adhoc scheduler over data. Need to OCR 10k files, MR it.

Hadoop was the worst possible implementation of MR, wasted so much of everything. That was its primary strength.

2 comments

hinkley 817 days ago

Very early on in my enterprise career, in a continuance of a discussion where it was mentioned that our customer was contemplating a terabyte disk array (that would fill an entire server rack, so very fucking early) I learned about the great grandfather of NVME drives: battery backed RAM disks that cost $40k inflation adjusted.

“Why on earth would you spend the cost of a brand new sedan on a drive like this?” I asked. Answer: to put the Oracle or DB2 WAL data on so you could vertically scale your database just that much higher while you tried to solve the throughput problems you were having another way. It was either the bargaining phase of loss or a Hail Mary you could throw in to help a behind-schedule rearchitecture. Last resort vertical scaling.

link

PaulHoule 817 days ago

Reminds me when I had a 3-machine Hadoop cluster in my home lab and 2 nodes were turned off but I was submitting jobs to get and getting results just fine.

I remember all the people pushing erasure code based distributed file systems pointing out how crazy it is to have three copies of something but Hadoop could run in a degraded condition without degraded performance.

link

sitkack 817 days ago

I agree. I used Disco MR to do amazing things. Trivial to use, like anyone could be productive in under an hour.

Erasure codes are awesome, but so is just having 3 copies. When you have skin in the game, simplicity is the most important driver of good outcomes. Look at the dimensions that Netezza optimized, they saw a technological window and they took it. Right now we have workstations that can push 100GB/s from from flash. We are talking about being able to sort 1TB of data in 20 seconds (from flash) the same machine could do it from ram in 10.

https://github.com/discoproject/disco

I need to give Ray and Dask a try.

I don't know where to put this comment so I'll put it here. DeWitt and Stonebraker are right, but also wrong. Everyone is talking past each other there. Both are geniuses, this essay wasn't super strong.

If I was their editor, I would say, reframe it as MapReduce is an implementation detail, we also need these other things for this to be usable by the masses. Their point about indexes proves my point about talking past each other. If you are scanning the data basically once, building an index is a waste.

link