Hacker News new | ask | show | jobs
by e12e 720 days ago
Some do, some don't. JSON is a special case as a valid JSON file needs to be a single array or object literal - event driven (SaX style) parsing needs to be a hack (like jq stream mode). In theory json_streamer or yajl should help, but I couldn't get a combination to return a proper lazy iterator.

With file as ndjson it was easier, if a little sparsely documented (Zlib::new or #wrap?):

    my_it = Zlib::GzipReader.wrap(some_ndfile).lazy
    obs = my_it.each_line.lazy.map do |line|
    JSON.parse line
  end.first(4)
When we can get a line at a time marshalling the whole line isn't an issue.

My issue is more that it is tricky to nest ruby IO objects and return a lazy iterator - especially nesting custom filters along the way - at least more tricky than it should be.

Apparently there's a third party frame work that does seem promising:

https://iostreams.rocketjob.io/tutorial

Or manual lifting:

https://dev.to/bajena/streaming-gzipped-csv-files-from-ftp-i...

Or:

https://medium.com/smartly-io/streaming-data-with-ruby-enume...

https://github.com/lautis/piperator

I think something more like this should probably be built in, and readily available (for gzip, http, files etc). Maybe I'm greedy.

Btw the shell pipeline to convert a file would be something like this, and is fully streaming:

    # gzipped JSON to gzipped ndjson, stripping top level array:
    gzcat file.json.gz \
     | jq -cn --stream 'fromstream(inputs|(.[0]  |= .[1:]) | select(. != [[]]) )' \
      | gzip -9 \
      > file.ndjson.gzip