Hacker News new | ask | show | jobs
by bitcointicker 3830 days ago
My recommendations...

For automated cluster building - https://ambari.apache.org/

For analysing your data, dynamically building queries and sharing this with other people in your company - https://zeppelin.incubator.apache.org/

And coming soon - https://www.zeppelinhub.com/

5 comments

Ambari looks nice from the outside but it's some kind of zombie that is full of os-specific stuff, opaque puppet stuff, python, java and even brings nagios and ganglia to the party. If it works it's probably fine, have fun debugging that stuff through.

We are happy with https://github.com/saltstack-formulas/hadoop-formula and PXE booting an image.

You can configure all aspects out of a single pillar file.

Still full of warts but at least you have full control.

If you haven't had the chance yet I suggest you try Cloudera Manager & CDH instead of Ambari. I use both with clients and CM is years ahead of Ambari in terms of functionality and stability.
I looked pretty hard at Zeppelin around 6 months ago, comparing it to iPython/Jupyter for use with Spark.

I found Zeppelin hard to install (I'm a Java programmer and Zeppelin is in Scala/Java so I expected the opposite). It was also extremely buggy.

Jupyter OTOH worked straight away, and even getting Spark integration working was straight forward compared to getting Zeppelin just working.

Zeppelin looks nicer, and some of the features look great. It just isn't there for production use atm though.

What about for provisioning clusters that don't require Hadoop. I suppose this could be akin to the comment about Redis -- we're working on deploying Kafka, Storm, and Zookeeper (none of which need Hadoop), and provisioning and node management (membership, leader election) in a dynamic environment (e.g. AWS autoscaling) is not at all obvious. There's also a paucity of substantive information about scaling these clusters dynamically.
I'm especially excited about Zeppelin. Using IPython for SciPy and smaller datasets is great. I would love it for the big data space I work in and Python's tooling to come together more.
IPython/Juypter works well against Spark. We have it working in production like that, and both Google[1] and IBM[2] do the same.

[1] https://cloud.google.com/datalab/overview

[2] https://www.ng.bluemix.net/docs/services/AnalyticsforApacheS...

I recently looked into notebooks and found Beaker (http://beakernotebook.com) to be especially interesting in its support for passing data across languages.