Hacker News new | ask | show | jobs
by qwertyuiop924 3565 days ago
But, as I pointed out in another comment, what about systems like Manta, which make transitioning from this sort of script to a full-on mapreduce cluster trivial?

Mind, I don't know the performance metrics for Manta vs Hadoop, but it's something to consider...

2 comments

Totally agree. It'd be relatively trivial to automate converting this script into a distributed application. Haven't checked Manta out, but I will. For ultimate performance though right now you could go for something like OpenMP + MPI which gets you scalability/fault-tolerance. In a few months you'll also be able to use something like RaftLib as a dataflow/stream processing API for distributed computation (almost ready to roll out the distributed back-end). MPI though has decades of research in HPC to make it the most robust distributed compute platform in existence (though not the most easy to use). You think your big data problems are big...nah, supercomputers were doing todays big data back in the late 90's. Just a totally different crowd with slightly different solutions. MPI is hard to use, Spark/Storm is much easier...but much slower.
From my experience organizations have adopted, Hive/Presto/Spark on top of Hadoop. Which actually solves a whole bunch of problems that "script" approach would not. With several added benefits. Executing scripts (cat, grep, uniq, sort) do not provide similar, benefits, while they might be faster. A dedicated solution such as Presto by Facebook will provide similar if not even faster results.

https://prestodb.io/

Ah, so it doesn't solve data storage, and runs SQL queries, which are less capable than UNIX commmands. If your data's stuck inside 15 SQL DBs, than that'd make sense, but a lot of data is just stored in flat files. And you know what's really good at analyzing flat files? Unix commands.
Did you even read it? Presto reads directly from HDFS, which is as close to distributed "flat files" as you can get. As far as "SQL being less capable than UNIX commands", you have got to be kidding me. SQL allows type checking, conversion, joins all of which are difficult if not impossible with grep | uniq | sort etc.
I read it.

>Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

That doesn't sound like HDFS to me. I mean, I assume it can read from HDFS, but Presto is backend agnostic. You could probably write code to run it on Manta. That would be neat for people who like Presto, I guess.

Type checking and conversions, no, and table joins only matter when you're handling relational data.

Also, how many formats can Presto handle? Unix utilities can handle just about any tabular data, and you can run them against non-tabular data in a pinch (although nobody reccomends it). I doubt Presto is that versitile.

Hive operates on top of HDFS.

Presto absolutley runs directly on HDFS.

Huh. Well then, I don't understand HDFS, or Facebook needs to fix Presto's front page. Both are reasonably likely.