| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mattewong 1204 days ago

Yes, that is exactly my point. You cannot start threads at 0/25/50/75 if your data is in CSV format. But what I am saying is that, if you could do that, then your performance difference will be negligible, compared to using a single thread that parses the CSV into rows and passes chunks of rows to 4 separate threads.

In fact, the single-thread parser approach (with multi-thread processing) might even be better, because it is not trying to access your hard disk in 4 places at the same time. Then again, if your threads are doing some non-trivial task with each row, then IO will not be your bottleneck either way.

Obviously starts to break down if you aren't reading the whole file and you wanted to start some meaningful portion of the way in and never process what comes before it. The point is, the benefit of being able to, effectively, implicitly shard a file without saving as separate files-- might not be as impactful in practice as in theory

1 comments

hermitcrab 1204 days ago

>Yes, that is exactly my point. You cannot start threads at 0/25/50/75 if your data is in CSV format.

My mistake, I misread your answer!

link