Hacker News new | ask | show | jobs
by arrowleaf 690 days ago
Truly, 100B nodes needs some sort of aggregation to have a chance at being useful. On a side project I've worked with normalizing >300GB semi-structured datasets that I could load up into dataframe libraries, I can't imagine working with a _graph_ of that size. I thought I was a genius when I figured out I could rent cloud computing resources with nearly a terabyte of RAM for less than federal minimum wage. At scale you quickly realize that your approach to data analysis is really bound by CPU, not RAM. This is where you'd need to brush off your data structures and algorithms books. OP better be good at graph algorithms.
1 comments

1) 100B? Try a thousand. Of course context matters, but I think it is common to overestimate the amount of information that can be visually conveyed at once. But it is also common to make errors in aggregation, or errors in how one interprets aggregation.

2) You may be interested in the large body of open source HPC visualization works. LLNL and ORNL are the two dominant labs in that space. Your issue might also be I/O since you can generate data faster than you can visualize it. One paradigm that HPC people utilize is "in situ" visualization. Where you visualize at runtime so that you do not hold back computation. At this scale, if you're not massively parallelizing your work, then it isn't the CPU that's the bottleneck, but the thing between the chair and keyboard. The downside of in situ is you have to hope you are visualizing the right data at the right time. But this paradigm includes pushing data to another machine that performs the processing/visualization or even storage (i.e. compute on the fast machine, push data to machine with lots of memory and that machine handles storage. Or more advanced, one stream to a visualization machine and another to storage). Checkout ADIOS2 for the I/O kind of stuff.

https://github.com/ornladios/ADIOS2