Hacker News new | ask | show | jobs
by jamesblonde 3694 days ago
In Hadoop, people are mostly using formats like Parquet, Orc, an if not, compression libs like lzo or snappy. If you believe the Berkeley people (I don't, but the sheeple do), most Spark workloads are CPU bound not IO bound. But irrespective of that, if most of your data is in a columnar data storage format, there's no gain (only cost) in having your FS also try and compress it. JBOD is considered best practice for Hadoop. That's why we're looking at RAID0 and RAID5 - we're researchers :) Actually, MapR recommend using 3 disks in RAID0 as volumes.