Hacker News new | ask | show | jobs
by frak_your_couch 3801 days ago
Hey, sorry, got busy yesterday and forgot to respond to this.

Well, I work for a Hadoop distribution, so I may have some biases showing through in my setup. I like to use spark in conjunction with Hadoop; I've never actually used it stand-alone before, honestly. For relational data, I'll ingest into Hive as that allows me to pivot to the right tool for the kind of analysis that I need to do whether it be simple SQL via Hive, something more suited for Pig via HCatalog or Spark via SparkSQL. I'll often do my analysis on a Hortonworks Sandbox for small data like this.

For larger data and a more professional setting, I like to do prototyping/ad hoc investigation/etc in python with pyspark inside of jupyter. Generally that transitions either to Java or just plain python (depending on the degree to which it's difficult to transition).

Anyway, hope that helps! Happy to answer any other questions you might have too. :)

1 comments

No worries! Gotcha, that is indeed very helpful! I appreciate it, will be in touch if I have more questions in the future :)