Hacker News new | ask | show | jobs
by mozumder 3578 days ago
Couldn't they have just used Postgres on one 4/8-socket server with RAID?

A 60TB dataset can fit in one server. Isn't Spark intended for massive clusters, like dozens or hundreds of servers, over petabytes of data?

2 comments

Note that it's 60TB compressed

Mostly it's about speed. With multiple machines you can bring more cores to bear for the processing, and have more RAM to cache partial results. Postgres could certainly do the job, but I'd be surprised if it would run within an order of magnitude of these results.

IO is pretty huge as well. You can spend lots and lots and lots of money to buy 1 machine a hard drive that can read 60 TB fast. Or you can have 100 machines with the cheapest possible hard drive and smoke the total IO.
Keep in mind that the systems required depend on the actual tasks, not just whether you can fit the data on a disk. I don't think the FB Newsfeed (show more photos from your cousin because you liked their text post last week...) can be built using only SQL.