| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by krnaveen14 1756 days ago

A few years back we were doing heavy file processing in Java with most of the time being spend in file i/o. Initially it was designed as receive zip file, extract zip to directory, read the extracted files one by one and processing it. If the zip file is 1GB and expands to 10GB when extracted, the amount of IO being done is significantly large. 1GB Read -> 10GB Write -> 10GB Read. Suppose our AWS Instance type is capable of 50GB/s, we were spending a minimum of 420 sec in IO operation itself.

This limited the throughout capacity of number of files which can be processed within a duration where the next set of zip files would be received in fixed interval. Since we passthrough the file only once during processing, we had to eliminitate the zip extraction and read the files in zip one by one as decompressed byte stream. This was possible with zipInputStream.getNextEntry() and reading the bytes but it posed a major refactoring and inconvenience where we now have to deal with byte[] instead of File in every place.

Then comes the most advanced nio and filesystem provider features (it was already available but we came to know about the benefits of it only then). All we had to do was simply replace File with Path and new FileInputStream() with Files.newInputStream() instead. Regarding zip file decompression, we simply replaced ZipInputStream and ZipEntry with FileSystems.newFileSystem() instead. Near instantly we were able to reduce 21GB IO into just 1GB IO reducing the total processing time exponentially.

Based on this understanding, we were able to implement the similar approach in zip file creation also where files will be written directly to zip as a compressed stream and only using Path in all places.

Developers are not just required have to a mental model about the memory constraints but also about the volume of Disk IO operations where the latency would instantly kill the application performance once the page cache cannot hold the files in memory.