|
|
|
|
|
by Loic
3410 days ago
|
|
The question is more, what do you want as result? Suppose you search in your 8TB database of molecules the 1000 molecules most similar to a given one, you have 16 cores, you cut the 8TB in 500GB skunks, preload continuously 1GB of molecules per core and accumulate 16*1000 molecules and merge at the end. You can do it on a single system and you work with a TB size dataset. It means that the size of the dataset is not the only factor, you need to take into account the operations performed on each "element/document", the size of the intermediate datasets and the size of the final results and some more stuff (encoding, etc.). |
|
But how do things change when the dataset grows to 9GB? Now we need more than one HD. Hadoop + Spark is built for this exact use case...