Hacker News new | ask | show | jobs
by iamjochem 3225 days ago
even if node-to-node communication in a cluster (hadoop or otherwise) itself is not secured, is it not reasonable to secure external access to the cluster itself (i.e. with a firewall)?

from an outsider perspective (I've never used/run hadoop) I cannot see much reason for exposing the cluster to the outside world - either a web-app acts as an intermediary or access can be provided via VPN/ssh-tunnel/etc

... just curious why a fully/publically exposed cluster would be a "requirement"? or does it come down to the fact that firewalling an AWS environment is as painful (if not more) than "kerberizing" a [hadoop] cluster? (I kind of assumed AWS has firewalling functionality that is fairly plug'n'play ... a quick search does really back that up though)

2 comments

I used to work at a big data consulting company and dealt with hadoop clusters at a bunch of different companies. What you described was absolutely the norm. The entire cluster closed to the outside world, except for one gateway machine that allows ssh access, and anything within the cluster is totally open. Sometimes some web services were open to the company VPN.

Kerberizing is a pain but not usually needed. You're correct that AWS firewall rules are very easy.

What you're seeing in this article is the exception, people doing it totally wrong.

In my experience these instances usually are either test/play clusters or just set up by people who don't know their way around Hadoop and its security features.

So: Yes, a Hadoop cluster should always be firewalled (secured or not) and have well defined access points via edge nodes as well as holes in the firewall for specific services that need to be exposed.