Hacker News new | ask | show | jobs
by aynyc 1875 days ago
Got ya. We are sticking with Par/Orc for now, we are running into the scenario where size of the data is going up, query SLA is going down. At some point, we will need to look at other technology to reduce cost without sacrificing performance.
1 comments

Yep. I may have been unclear, they work well together: we'll do a gpu parquet reader that returns an arrow dataframe that our ETL pipeline then transforms into visual depictions of the correlations+relationships in people's datasets. Stuff on disk is nice stable formats, stuff across our API boundaries & compute frameworks is arrow.
Interesting design! How big is your data per scan?
it varies.. a lot of our users look at say 50kb files for quick small and targeted visual sessions , but when doing something like a log dump analysis, we are working on TB files and 1-2 GB per streaming part is good. CPU arrow people like to do say 10KB-1MB per record batch, but GPU land is a lot faster by thinking in terms of bandwidth, and so 500MB-10GB per contiguous part, depending on GPU memory and working set size. likewise, depends on how compressed it is, as you ultimately care how much it uncompresses into for the downstream memory pressure. hope that makes sense!
You run TB files against GPU? Hmm... that's something I've never thought off. Interesting, any idea where I can research into?
rapids.ai
Should have added: Graphistry talk @ https://pavilion.io/nvidia