Hacker News new | ask | show | jobs
by nottorp 2524 days ago
Java is... java.

I was once working on an Android app on a cheap custom board with 128 M ram (don't ask why Android on a single function custom board, wasn't my decision).

Among other things, I had to parse a 80000 line csv file. Splitting and the rest of the processing created so many temporary strings the system ran out of ram. We eventually gave up.

6 comments

Myself I've worked with gigabyte sized CSV files without issues, so it was likely your implementation rather than the fault of the Java language.
Googles "parse csv Java"

Copies and pasted top answer that buffers the whole thing into the heap before parsing

Blames Java/hardware when this doesn't scale

Bad string allocations can easily swamp the JVM, with things like loops adding a single character to a string with a string allocation each loop, and other naive approaches turning what should be a small reused buffer into gigabytes of heap.

The JVM does a remarkable job of just-good-enough don't-worry-about-GC, but anyone who is a paid programmer needs to understand rudimentary aspects of GC, or your stuff will go sideways fast in production.

<quote>Bad string allocations can easily swamp the JVM</quote>

Like String.split() and whatever BufferedReader does, I guess. It's not like i was doing byte by byte manipulation. The GC just triggered too often.

Did you do it in 128 megs of RAM?
Depending on what needs to be done with the CSV files, it's very possible to do it in 128MB of RAM. For example, if we need to read the rows, transform them a bit, and then write to another file, we can read up to N rows, transform them, and then write them. That should result in bounded memory consumption because only up to N rows need to be kept. Similar strategies are possible if the rows are used as input to an ETL job, calling a Web service with the results of parsing the file, etc.

Editing a file gets trickier, though it's not impossible. Maybe using a [piece table](https://en.wikipedia.org/wiki/Piece_table) plus some smart buffering the file can keep memory consumption below some constant, letting it function for large files, but with the downside of lower performance for files larger than whatever the constant is?

> Among other things, I had to parse a 80000 line csv file. Splitting and the rest of the processing created so many temporary strings the system ran out of ram. We eventually gave up

Let us all hope that Project Valhalla, the effort to add value types to Java, comes to a swift completion. It would be very helpful in these sorts of scenarios.

Though at this point I'd wonder (as you did) why I'm using Java in the first place given the resource constraints of the system, I've had some success using byte[] arrays and using `yourInputStreamHere.read(someBuffer, yourOffset, YOUR_PAGE_SIZE)` in a few less constrained but relatively more performance critical areas.

Depending on your exact needs, this can sometimes result in more or less constant memory consumption; it's possible to reuse the array on each read. You can also take it a step further by finding the indexes where you would split the array (e.g. the index of the next comma for a CSV) and using methods that operate on an array plus start and end indexes. That sort of strategy can allow you to avoid additional copies the array, at the cost of somewhat annoying and less maintainable code that is definitely not Java-esque.

There are also less severe solutions that won't buffer the whole file in memory, using StringBuilder or other techniques, etc. depending on what you're doing.

This all might be for naught though; I'm assuming the records have to be parsed into some structure, so tricks like the above might not be good enough given Java's seemingly insatiable hunger for heap space.

While I'm not the language's greatest fan, this is one area where Go has been able to shine. Having value types as a possibility makes solving this sort of problem a typically less expensive proposition.

I'm a little confused, trying to parse an entire file in memory instead of streaming it is one of the mistakes I might make in my first year or two of development. parsing 80k lines of CSV in java is pretty easy as long as you write your code to be efficient and release memory line by line
Interesting comments so let's have some answers:

1. 128 M ram total memory, 20-30 M free at best. CPU was slow and we were running off a SD card. Not exactly an enterprise monster.

2. Of course I was parsing it line by line. But running OOM triggered the garbage collector so many times that it took like 3-4 minutes to finish. I didn't mean that the system actually stopped because of running out of ram.

3. Why didn't you do it with the NDK/some other solution? Because the PM just gave up on the feature when it wasn't done in a day. It was just real time search completion whith we didn't really need there, it was just "nice to have".

I’m parsing much larger datasets than those on android, in less than 18M RAM usage.

You just have to use NIO and use direct buffers instead of naively reading and allocating strings.

So what's the point of using Java then? Of course I can do it in a non GC language using a buffer that holds one line and as much memory as needed for the data I want to retain...
Not having to worry about memory for 99% of the time.

Having to do research to optimize I/O is not just a java thing. As stated before in this thread, the C stdlib isn't historically great either, and a lot of languages piggyback on that.

Modern I/O is almost always a different library and you need to do reasearch.

Direct Buffers are a Java API to directly use C buffers and access them in a fast way. They're the fastest way to handle memory in Java.
Android gives bad name to Java with its Dalvik and ART partial implementations.
Why didn't you just write it using the NDK?