|
|
|
|
|
by henrythe9th
4724 days ago
|
|
Thanks for your input. We're roughly talking around 5GB of data. Data growth should be linear in the next 6months. Money is not a big concern. Speed of iteration is key. We frequently run different processing algorithms over the entire stored dataset (stored data doesn't change) and update the calculated features each time. Not sure if this helps narrows things down. Thanks |
|
However, 5GB of data is literally nothing, and that statement holds till your data size is atleast 50-60 GB. Given that 64 GAM RAM machines are now commodity, I would just load the entire thing in RAM and write a multi-threaded program. Sounds old school, but regardless of how well documented hadoop, spark and storm are, there is nevertheless a learning curve and a maintenance cost. Both of which are well worth only if you see your data rapidly growing to the X TB range. Otherwise, it might be just easier to stick it in a single machine and get stuff done.
You can stick to Scala/Java, and so long you develop good abstractions around your core algorithms, you can always move to spark/hadoop when you need it. Feel free to send me an email if you want to talk more (email in profile).