Hacker News new | ask | show | jobs
by copirate 720 days ago
You can pipe with the `pipeline*` method of open3 which is part of the stdlib:

For example:

    require "open3"
    last_stdout, wait_threads = Open3.pipeline_r("cat /etc/passwd", ["grep", "root"])
    last_stdout.read # => "root:x:0:0::/root:/bin/bash\n"
    wait_threads.map(&:value).map(&:success?) # => [true, true]
https://ruby-doc.org/3.2.2/stdlibs/open3/Open3.html
1 comments

Can you easily chain these, though? (gzcat some.txt|grep foo|sort -u|head -10 etc?). Especially lazily, if the uncompressed stream is of modest size, like a couple of gigabytes?
Of course, you can easily chain many commands in the pipeline:

    last_stdout, wait_threads = Open3.pipeline_r(
      ["gzcat", "some.txt"],
      ["grep", "foo"],
      ["sort", "-u"],
      ["head", "-10"],
    )
I'm not sure what you mean by lazily here, but internally[0] it creates real anonymous pipes[1] between the spawned processes, so the data does not go through the ruby process at all.

[0] https://github.com/ruby/open3/blob/b8909222051b4103a19eba195...

[1] https://en.wikipedia.org/wiki/Anonymous_pipe

I'd suspect you could do that with Open3, but if you are, why not just read the file and process with Ruby instead?
I'm currently working with 150MB worth of gzipped JSON - marshalling the full file from JSON to ruby hash eats up a lot of memory. One tweak that allows for easier lazy iteration over the file (while keeping temporary disk Io reasonable) is to pipe it through zcat, jq in stream mode to convert to ndjson, gzip again - for a temp file that ruby zlib can wrap for a stream convenient for lazy iteration per read_line...).

Generally marshalling a gig or more of JSON (non-lazily) takes a lot of resources in ruby.

Hmm. I don't typically mind throwing memory at a problem like that, but I can certainly see the issue.

Is lazy marshalling something that other languages handle better?

Some do, some don't. JSON is a special case as a valid JSON file needs to be a single array or object literal - event driven (SaX style) parsing needs to be a hack (like jq stream mode). In theory json_streamer or yajl should help, but I couldn't get a combination to return a proper lazy iterator.

With file as ndjson it was easier, if a little sparsely documented (Zlib::new or #wrap?):

    my_it = Zlib::GzipReader.wrap(some_ndfile).lazy
    obs = my_it.each_line.lazy.map do |line|
    JSON.parse line
  end.first(4)
When we can get a line at a time marshalling the whole line isn't an issue.

My issue is more that it is tricky to nest ruby IO objects and return a lazy iterator - especially nesting custom filters along the way - at least more tricky than it should be.

Apparently there's a third party frame work that does seem promising:

https://iostreams.rocketjob.io/tutorial

Or manual lifting:

https://dev.to/bajena/streaming-gzipped-csv-files-from-ftp-i...

Or:

https://medium.com/smartly-io/streaming-data-with-ruby-enume...

https://github.com/lautis/piperator

I think something more like this should probably be built in, and readily available (for gzip, http, files etc). Maybe I'm greedy.

Btw the shell pipeline to convert a file would be something like this, and is fully streaming:

    # gzipped JSON to gzipped ndjson, stripping top level array:
    gzcat file.json.gz \
     | jq -cn --stream 'fromstream(inputs|(.[0]  |= .[1:]) | select(. != [[]]) )' \
      | gzip -9 \
      > file.ndjson.gzip