Can you easily chain these, though? (gzcat some.txt|grep foo|sort -u|head -10 etc?). Especially lazily, if the uncompressed stream is of modest size, like a couple of gigabytes?
I'm not sure what you mean by lazily here, but internally[0] it creates real anonymous pipes[1] between the spawned processes, so the data does not go through the ruby process at all.
I'm currently working with 150MB worth of gzipped JSON - marshalling the full file from JSON to ruby hash eats up a lot of memory. One tweak that allows for easier lazy iteration over the file (while keeping temporary disk Io reasonable) is to pipe it through zcat, jq in stream mode to convert to ndjson, gzip again - for a temp file that ruby zlib can wrap for a stream convenient for lazy iteration per read_line...).
Generally marshalling a gig or more of JSON (non-lazily) takes a lot of resources in ruby.
Some do, some don't. JSON is a special case as a valid JSON file needs to be a single array or object literal - event driven (SaX style) parsing needs to be a hack (like jq stream mode). In theory json_streamer or yajl should help, but I couldn't get a combination to return a proper lazy iterator.
With file as ndjson it was easier, if a little sparsely documented (Zlib::new or #wrap?):
my_it = Zlib::GzipReader.wrap(some_ndfile).lazy
obs = my_it.each_line.lazy.map do |line|
JSON.parse line
end.first(4)
When we can get a line at a time marshalling the whole line isn't an issue.
My issue is more that it is tricky to nest ruby IO objects and return a lazy iterator - especially nesting custom filters along the way - at least more tricky than it should be.
Apparently there's a third party frame work that does seem promising:
For example:
https://ruby-doc.org/3.2.2/stdlibs/open3/Open3.html