| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mozumder 3578 days ago
	Couldn't they have just used Postgres on one 4/8-socket server with RAID? A 60TB dataset can fit in one server. Isn't Spark intended for massive clusters, like dozens or hundreds of servers, over petabytes of data?

2 comments

openasocket 3578 days ago

Note that it's 60TB compressed

Mostly it's about speed. With multiple machines you can bring more cores to bear for the processing, and have more RAM to cache partial results. Postgres could certainly do the job, but I'd be surprised if it would run within an order of magnitude of these results.

link

brianwawok 3578 days ago

IO is pretty huge as well. You can spend lots and lots and lots of money to buy 1 machine a hard drive that can read 60 TB fast. Or you can have 100 machines with the cheapest possible hard drive and smoke the total IO.

link

firasd 3578 days ago

Keep in mind that the systems required depend on the actual tasks, not just whether you can fit the data on a disk. I don't think the FB Newsfeed (show more photos from your cousin because you liked their text post last week...) can be built using only SQL.

link