|
|
|
|
|
by bunderbunder
668 days ago
|
|
To your last point, it's been interesting to watch people struggle to effectively use technologies like Hadoop and Spark now that we've all moved to the cloud. Originally, the whole point of the Hadoop architecture was that the data were pre-chunked and already sitting on the local storage of your compute nodes, so that the overhead to transfer at least that first map task was effectively zero, and your big data transfer cost was collecting all the (hopefully much smaller than your input data) results of that into one place in the reduce step. Now we're in the cloud and the original data's all sitting in object storage. So shoving all your raw data through a tiny small slow network interface is an essential first step of any job, and it's not nearly so easy to get speedups that were as impressive as what people were doing 15 years ago. That said I wouldn't want to go back. HDFS clusters were such a PITA to work with and I'm not the one paying the monthly AWS bill. |
|