Hacker News new | ask | show | jobs
by e12e 720 days ago
I'm currently working with 150MB worth of gzipped JSON - marshalling the full file from JSON to ruby hash eats up a lot of memory. One tweak that allows for easier lazy iteration over the file (while keeping temporary disk Io reasonable) is to pipe it through zcat, jq in stream mode to convert to ndjson, gzip again - for a temp file that ruby zlib can wrap for a stream convenient for lazy iteration per read_line...).

Generally marshalling a gig or more of JSON (non-lazily) takes a lot of resources in ruby.

1 comments

Hmm. I don't typically mind throwing memory at a problem like that, but I can certainly see the issue.

Is lazy marshalling something that other languages handle better?

Some do, some don't. JSON is a special case as a valid JSON file needs to be a single array or object literal - event driven (SaX style) parsing needs to be a hack (like jq stream mode). In theory json_streamer or yajl should help, but I couldn't get a combination to return a proper lazy iterator.

With file as ndjson it was easier, if a little sparsely documented (Zlib::new or #wrap?):

    my_it = Zlib::GzipReader.wrap(some_ndfile).lazy
    obs = my_it.each_line.lazy.map do |line|
    JSON.parse line
  end.first(4)
When we can get a line at a time marshalling the whole line isn't an issue.

My issue is more that it is tricky to nest ruby IO objects and return a lazy iterator - especially nesting custom filters along the way - at least more tricky than it should be.

Apparently there's a third party frame work that does seem promising:

https://iostreams.rocketjob.io/tutorial

Or manual lifting:

https://dev.to/bajena/streaming-gzipped-csv-files-from-ftp-i...

Or:

https://medium.com/smartly-io/streaming-data-with-ruby-enume...

https://github.com/lautis/piperator

I think something more like this should probably be built in, and readily available (for gzip, http, files etc). Maybe I'm greedy.

Btw the shell pipeline to convert a file would be something like this, and is fully streaming:

    # gzipped JSON to gzipped ndjson, stripping top level array:
    gzcat file.json.gz \
     | jq -cn --stream 'fromstream(inputs|(.[0]  |= .[1:]) | select(. != [[]]) )' \
      | gzip -9 \
      > file.ndjson.gzip