Hacker News new | ask | show | jobs
by karterk 4724 days ago
Hard to offer suggestions without knowing rough size of data - depending on how much money you're willing to cough up, even 1 TB is in the range of "can fit in the memory" territory.

Having said that, Spark is really great for running iterative algorithms and will definitely fit with what you have described. I suggest staying away from building it on your own using riak/redis (atleast until you have ruled out spark), as you will run into lots of operational issues like handling failures, resource allocation, retries etc.

1 comments

Thanks for your input. We're roughly talking around 5GB of data. Data growth should be linear in the next 6months. Money is not a big concern. Speed of iteration is key.

We frequently run different processing algorithms over the entire stored dataset (stored data doesn't change) and update the calculated features each time. Not sure if this helps narrows things down. Thanks

A little bit of context: I have done a lot of hadoop, and also well aware of spark and storm. Storm is mostly well suited for handling a stream of real-time data. Spark is specifically for running iterative algorithms - it can read from HDFS, and with the expressiveness of Scala, it's great for building machine-learning related stuff.

However, 5GB of data is literally nothing, and that statement holds till your data size is atleast 50-60 GB. Given that 64 GAM RAM machines are now commodity, I would just load the entire thing in RAM and write a multi-threaded program. Sounds old school, but regardless of how well documented hadoop, spark and storm are, there is nevertheless a learning curve and a maintenance cost. Both of which are well worth only if you see your data rapidly growing to the X TB range. Otherwise, it might be just easier to stick it in a single machine and get stuff done.

You can stick to Scala/Java, and so long you develop good abstractions around your core algorithms, you can always move to spark/hadoop when you need it. Feel free to send me an email if you want to talk more (email in profile).

Thanks for the suggestion. We've actually thought about just writing a multithreaded system on a single machine. What type of in-memory storage would you recommend in this case? (which hopefully may be extended to a distributed cluster of machines if 1 really large machine becomes expensive)

Thanks

I suggest storing your data in files and just memory mapping them during start-up. JVM can't memory map more than 2GB per file, so just create logical shards, and map them independently.

Since you will be mostly iterating over all records during your iterative algorithms, storing them in a separate in-memory DB makes no sense (have to call external process via socket).

You can then use a framework like zookeeper/akka for managing nodes in the event that you have to scale out. Even a simple master/slave set-up using thrift services will do.