| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by karbarcca 2081 days ago

It can make huge amounts of difference in a production system; where I work, we process terabytes of csv data every day; saving minutes per file can add up to enormous differences in CPU cost/time for a production system running 24/7.

I agree that for a data scientist doing exploratory analysis locally on their computer, it doesn't make nearly as much a difference (also because they're usually not working on crazy large files).

The performance work in the CSV.jl package (that the article is about) was very much geared towards these kinds of production scenarios.

3 comments

mcrad 2081 days ago

Right - sounds like you have more of a production support role vs. a data analysis workflow kind of task. Tacking on "exploratory" is helpful but I'm still concerned that you misuse the overall concept of analysis. It's decision-making task, which is practically the opposite of production support.

link

ChrisRackauckas 2080 days ago

>Tacking on "exploratory" is helpful but I'm still concerned that you misuse the overall concept of analysis. It's decision-making task, which is practically the opposite of production support.

Why should the exploratory and production teams be using completely different tools? That seems like it would cause frictions in productivity and make there be gaps that introduce translation errors. I would venture to say that just having the exploratory and production teams working using the same code base is a very strong productivity gain, and we've seen this is true in many companies.

link

mcrad 2070 days ago

One is R&D, and one is operations. Obviously there are situations where it makes sense to combine them, but most successful companies with any real technology to sell will not treat R&D this way.

link

mr_toad 2081 days ago

> process terabytes of csv data every day

All stored on NVMe SSDs? Because unless you have really fast IO the CSV parser isn’t going to be the bottleneck.

link

andi999 2081 days ago

I am curious: how do you transfer these amounts of data fast?

link

karbarcca 2081 days ago

Unfortunately almost exclusively via http rest apis. It's not great, but it's the lowest common denominator between the vast "ingestion" service we've built (connectors to web apis, local application for local file upload, raw api endpoints, etc.).

We've started exploring the apache arrow format as a compressible binary format with a dedicated wire format just to cut down on parsing processing costs.

link