| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by discardorama 4275 days ago

As per the specification of this test, the data has to be committed on disk before it is considered sorted. So even if it all fits in memory, it has to be on disk before the end.

So you have 100TB of disk read, followed by 100TB of disk write, all on HDDs. That's about 100GB/node; and since Hadoop nodes are typically in RAID-6, each write has an associated read and write too.

This does not even include the intermediate files, which (depending on how the kernel parameters have been set), could have been written on disk. Typical dirty_background_ratio is 10; so after 6GB of dirty pages, pdflush will kick in and start writing to the spinning disk.

1 comments

rxin 4273 days ago

Yes, but the final data is sequential only. We were discussing about random access, which only applies to the intermediate shuffle file.

Maybe you can email me offline. I can tell you more about the setup and how Spark / MapReduce works w.r.t. to it.

link