|
|
|
Ask HN: Fast, In-Memory, Distributed data analysis and machine learning?
|
|
5 points
by henrythe9th
4724 days ago
|
|
We're looking to implement a new data pipeline architecture at work. The primary goal is speed (data size is small enough to fit entirely in memory, sharded across multiple machines if needed). The primary bottleneck is feature extraction, transformation and iteration, which is both CPU and read/write intensive. Model building is not too slow, so no need to distribute training/testing as of yet. I've heard good things about Spark/Shark and Storm. Does anyone have any experiences or recommendations? Maybe we don't even need a super sophisticated system and a Riak/Redis K-V store cluster would do? Thanks in advance |
|
Having said that, Spark is really great for running iterative algorithms and will definitely fit with what you have described. I suggest staying away from building it on your own using riak/redis (atleast until you have ruled out spark), as you will run into lots of operational issues like handling failures, resource allocation, retries etc.