| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by henrythe9th 4772 days ago
	Thanks for your input. We're roughly talking around 5GB of data. Data growth should be linear in the next 6months. Money is not a big concern. Speed of iteration is key. We frequently run different processing algorithms over the entire stored dataset (stored data doesn't change) and update the calculated features each time. Not sure if this helps narrows things down. Thanks

1 comments

karterk 4772 days ago

A little bit of context: I have done a lot of hadoop, and also well aware of spark and storm. Storm is mostly well suited for handling a stream of real-time data. Spark is specifically for running iterative algorithms - it can read from HDFS, and with the expressiveness of Scala, it's great for building machine-learning related stuff.

However, 5GB of data is literally nothing, and that statement holds till your data size is atleast 50-60 GB. Given that 64 GAM RAM machines are now commodity, I would just load the entire thing in RAM and write a multi-threaded program. Sounds old school, but regardless of how well documented hadoop, spark and storm are, there is nevertheless a learning curve and a maintenance cost. Both of which are well worth only if you see your data rapidly growing to the X TB range. Otherwise, it might be just easier to stick it in a single machine and get stuff done.

You can stick to Scala/Java, and so long you develop good abstractions around your core algorithms, you can always move to spark/hadoop when you need it. Feel free to send me an email if you want to talk more (email in profile).

link

henrythe9th 4771 days ago

Thanks for the suggestion. We've actually thought about just writing a multithreaded system on a single machine. What type of in-memory storage would you recommend in this case? (which hopefully may be extended to a distributed cluster of machines if 1 really large machine becomes expensive)

Thanks

link

karterk 4771 days ago

I suggest storing your data in files and just memory mapping them during start-up. JVM can't memory map more than 2GB per file, so just create logical shards, and map them independently.

Since you will be mostly iterating over all records during your iterative algorithms, storing them in a separate in-memory DB makes no sense (have to call external process via socket).

You can then use a framework like zookeeper/akka for managing nodes in the event that you have to scale out. Even a simple master/slave set-up using thrift services will do.

link