Hacker News new | ask | show | jobs
by TheChaplain 2524 days ago
Myself I've worked with gigabyte sized CSV files without issues, so it was likely your implementation rather than the fault of the Java language.
2 comments

Googles "parse csv Java"

Copies and pasted top answer that buffers the whole thing into the heap before parsing

Blames Java/hardware when this doesn't scale

Bad string allocations can easily swamp the JVM, with things like loops adding a single character to a string with a string allocation each loop, and other naive approaches turning what should be a small reused buffer into gigabytes of heap.

The JVM does a remarkable job of just-good-enough don't-worry-about-GC, but anyone who is a paid programmer needs to understand rudimentary aspects of GC, or your stuff will go sideways fast in production.

<quote>Bad string allocations can easily swamp the JVM</quote>

Like String.split() and whatever BufferedReader does, I guess. It's not like i was doing byte by byte manipulation. The GC just triggered too often.

Did you do it in 128 megs of RAM?
Depending on what needs to be done with the CSV files, it's very possible to do it in 128MB of RAM. For example, if we need to read the rows, transform them a bit, and then write to another file, we can read up to N rows, transform them, and then write them. That should result in bounded memory consumption because only up to N rows need to be kept. Similar strategies are possible if the rows are used as input to an ETL job, calling a Web service with the results of parsing the file, etc.

Editing a file gets trickier, though it's not impossible. Maybe using a [piece table](https://en.wikipedia.org/wiki/Piece_table) plus some smart buffering the file can keep memory consumption below some constant, letting it function for large files, but with the downside of lower performance for files larger than whatever the constant is?