| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by praseodym 4047 days ago
	And if your data doesn't fit in a single server's RAM, just buy some more and run Apache Spark [1] on them. It's an in-memory computation engine that's really nice to program for: you don't have to worry about low-level clustering details (like MapReduce). And it's way (10-100x) faster than Hadoop. [1] https://spark.apache.org

1 comments

threeseed 4047 days ago

Spark is fast becoming the default tool for big data.

The recent addition of SparkR in 1.4 means that now data scientists can leverage in memory data in the cluster that has been put there by output from either Scala or DW developers.

Combine it with Tachyon (http://tachyon-project.org) and it's not hard to imagine petabytes of data all processed in memory.

link

studentrob 4047 days ago

Can you explain what Tachyon does that's different from what Spark already provides?

I haven't used either Spark or Tachyon. I thought the Spark solution was to just put my dataset in memory. But the Tachyon page seems to say the same thing

link

nl 4047 days ago

There's a slide deck[1] that explains it rather well.

Basically, Tachyon acts as a distributed, reliable, in memory file system.

To generalise enormously, programs have problems sharing data in RAM. Tachyon lets you share data between (say) your Spark jobs and your Hadoop Map/Reduce jobs at RAM speed, even across machines (it understands data-locality, so will attempt to keep data close to where it is being used).

[1] http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2014-10-16...

link

studentrob 4047 days ago

Neat, thanks for the link, the code examples towards the end make it clear that this is pretty simple to use.

link

nl 4047 days ago

Yeah, most things coming from the Spark team are excellent in that respect.

I've never used Tachyon, but based on the wonderful "getting started" experience Spark gives I'd be confident it would be similarly well thought out.

link

threeseed 4047 days ago

As others have explained Tachyon does the "put dataset in memory" part.

Spark started off as the "in memory map reduce" but has now become a platform for Scala, Java, Python, HiveQL, SQL and R code to run. It is the most active Apache project and is getting more and more powerful by the day.

Given how easy it is to get running it wouldn't surprise me to see it in the years to come being using as the primary front end for all data needs.

link