Hacker News new | ask | show | jobs
by mjpt777 4987 days ago
How would it work with object pooling if I wanted to query a large table of data? This is often needed in real big data applications.
1 comments

For that kind of problem I'd probably be using hadoop, which does object pooling internally with the objects it passes into your mappers/reducers.

For a non-hadoop datasource you could do the same thing by hand: stream in the data from the table, turning it into objects from your pool and passing them through to your reducer function in small batches.

Interesting. It sounds like your issues are IO dominant since you do not mind the JVM startup cost from Hadoop for each query on each node. I'm more often looking at large data that is all memory resident which tends to drive the design this way. In finance queries need to have latencies way below sub-second which Hadoop cannot come close to satisfying. This is comparing batch to real-time analytics.
You're right that most of my big-data experience is batch work, and outside of finance. I guess I'm finding it hard to envision the kind of data where you'd want to work on the whole set, but that set's small enough to fit into memory - for real-time analytics wouldn't you be wanting to stream data and reduce it to the representation you want as it comes in?
In finance you may be re-evaluating a whole portfolio of assets, or doing a value at risk (VAR) calculation across everything. More often you want low-latency access to the entire dataset without going to disk. For this the entire data must be memory resident and compact.