| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aynyc 1875 days ago
	Got ya. We are sticking with Par/Orc for now, we are running into the scenario where size of the data is going up, query SLA is going down. At some point, we will need to look at other technology to reduce cost without sacrificing performance.

1 comments

lmeyerov 1875 days ago

Yep. I may have been unclear, they work well together: we'll do a gpu parquet reader that returns an arrow dataframe that our ETL pipeline then transforms into visual depictions of the correlations+relationships in people's datasets. Stuff on disk is nice stable formats, stuff across our API boundaries & compute frameworks is arrow.

link

aynyc 1875 days ago

Interesting design! How big is your data per scan?

link

lmeyerov 1875 days ago

it varies.. a lot of our users look at say 50kb files for quick small and targeted visual sessions , but when doing something like a log dump analysis, we are working on TB files and 1-2 GB per streaming part is good. CPU arrow people like to do say 10KB-1MB per record batch, but GPU land is a lot faster by thinking in terms of bandwidth, and so 500MB-10GB per contiguous part, depending on GPU memory and working set size. likewise, depends on how compressed it is, as you ultimately care how much it uncompresses into for the downstream memory pressure. hope that makes sense!

link

aynyc 1875 days ago

You run TB files against GPU? Hmm... that's something I've never thought off. Interesting, any idea where I can research into?

link

lmeyerov 1875 days ago

rapids.ai

link

lmeyerov 1873 days ago

Should have added: Graphistry talk @ https://pavilion.io/nvidia

link