Hacker News new | ask | show | jobs
by filereaper 3410 days ago
I apologize in advance, but whenever people claim to use a in-memory big-data system, how exactly does this end up working?

You can only stuff so much into memory, so you can scale up vertically in-terms of memory, unless you buy a massive big-iron POWER box, you scale out horizontally. But with each of these in-memory appliances, what happens when you need to spill out to disk?

In essence why should one bother with these in-memory appliances as opposed to buying boxes with fast SSD's instead? Sure you spill out to disk, but do you take that big of a hit compared to the enormous cost of keeping everything in memory?

3 comments

I think there are many use cases. Fraud detection, risk analysis in finance, weather simulations, etc. These don't need to spill out to disk and are a perfect use case for these systems.

A friend of mine works for a company that does high speed weather analysis to make predictions for energy brokers, to predict prices of wind / solar energy on the market. They use these kind of systems extensively, because of the speed and volatility of the data. Fascinating stuff.

You can also measure cloud oktas from satellite imagery of you want to get fancy in terms of solar energy supply side forecasting: https://axibase.com/calculating-cloud-oktas/
Maybe I'm misunderstanding the problem, but why can't you scale out horizontally?

If the problem is that queries or sets of data might have to jump nodes, couldn't the data be designed in such a way where an assumption is made about what sorts of queries will happen at write?

Optimize so that node spanning is rare, eat the cost when it does happen, and let those 1/n queries disappear into the average.

My lab works with multi-terabyte datasets on a regular basis. We have big machines to do machine learning on, but when they're not in used, I can tell you that it's way easier to provision and write a single or multi-threaded script that just loads everything into memory rather than deal with networking and partitioned data.

Imagine the difference between setting up a spark cluster and writing a for loop. For instance, for reasons someone created a 1TB hdf5 file. Luckily, we had a computer with 500GB+ of ram and lots of swap, so instead of having to hack the file apart and figure out how to chunk or parallelize it, we loaded it into memory for a one time batch job and did other useful things in the mean time.

It's not big data if it fits in memory... This article is demonstrating an architecture that may scale well with big data.