| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by n_e 81 days ago
	I process TB-size ndjson files. I want to use jq to do some simple transformations between stages of the processing pipeline (e.g. rename a field), but it so slow that I write a single-use node or rust script instead.

4 comments

messe 81 days ago

Now I'm really curious. What field are you in that ndjson files of that size are common?

I'm sure there are reasons against switching to something more efficient–we've all been there–I'm just surprised.

link

overfeed 81 days ago

> Now I'm really curious. What field are you in that ndjson files of that size are common?

I'm not OP,but structured JSON logs can easily result in humongous ndjson files, even with a modest fleet of servers over a not-very-long period of time.

link

messe 81 days ago

So what's the use case for keeping them in that format rather than something more easily indexed and queryable?

I'd probably just shove it all into Postgres, but even a multi terabyte SQLite database seems more reasonable.

link

carlmr 81 days ago

Replying here because the other comment is too deeply nested to reply.

Even if it's once off, some people handle a lot of once-offs, that's exactly where you need good CLI tooling to support it.

Sure jq isn't exactly super slow, but I also have avoided it in pipelines where I just need faster throughput.

rg was insanely useful in a project I once got where they had about 5GB of source files, a lot of them auto-generated. And you needed to find stuff in there. People were using Notepad++ and waiting minutes for a query to find something in the haystack. rg returned results in seconds.

link

messe 81 days ago

You make some good points. I've worked in support before, so I shouldn't have discounted how frequent "once-offs" can be.

link

paavope 81 days ago

The use case could be e.g. exactly processing an old trove of logs into something more easily indexed and queryable, and you might want to use jq as part of that processing pipeline

link

messe 81 days ago

Fair, but for a once-off thing performance isn't usually a major factor.

The comment I was replying to implied this was something more regular.

EDIT: why is this being downvoted? I didn't think I was rude. The person I responded to made a good point, I was just clarifying that it wasn't quite the situation I was asking about.

link

adastra22 81 days ago

At scale, low performance can very easily mean "longer than the lifetime of the universe to execute." The question isn't how quickly something will get done, but whether it can be done at all.

link

bigDinosaur 81 days ago

Certain people/businesses deal with one-off things every day. Even for something truly one-off, if one tool is too slow it might still be the difference between being able to do it once or not at all.

link

eru 81 days ago

This reminds me of someone who wrote a regex tool that matches by compiling regexes (at runtime of the tool) via LLVM to native code.

You could probably do something similar for a faster jq.

link

loxias 80 days ago

I would love, _love_ to know more about your data formats, your tools, what the JSON looks like, basically as much as you're willing to share. :)

For about a month now I've been working on a suite of tools for dealing with JSON specifically written for the imagined audience of "for people who like CLIs or TUIs and have to deal with PILES AND PILES of JSON and care deeply about performance".

For me, I've been writing them just because it's an "itch". I like writing high performance/efficient software, and there's a few gaps that it bugged me they existed, that I knew I could fill.

I'm having fun and will be happy when I finish, regardless, but it would be so cool if it happened to solve a problem for someone else.

link

landr0id 80 days ago

I maintain some tools for the videogame World of Warships. The developer has a file called GameParams.bin which is Python-pickled data (their scripting language is Python).

Working with this is pretty painful, so I convert the Pickled structure to other formats including JSON.

The file has always been prettified around ~500MB but as of recently expands to about 3GB I think because they’ve added extra regional parameters.

The file inflates to a large size because Pickle refcounts objects for deduping, whereas obviously that’s lost in JSON.

I care about speed and tools not choking on the large inputs so I use jaq for querying and instruction LLMs operating on the data to do the same.

link

nchmy 81 days ago

This isn't for you then

> The query language is deliberately less expressive than jq's. jsongrep is a search tool, not a transformation tool-- it finds values but doesn't compute new ones. There are no filters, no arithmetic, no string interpolation.

Mind me asking what sorts of TB json files you work with? Seems excessively immense.

link

rennokki 81 days ago

> Uses jq for TB json files

> Hadoop: bro

> Spark: bro

> hive: bro

> data team: bro

link

eevmanu 81 days ago

made me remember this article

<https://adamdrake.com/command-line-tools-can-be-235x-faster-...>

  Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

  Conclusion: Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.

link

rennokki 76 days ago

This article is good for new programmers to understand why certain solutions are better at scale, there is no silver bullet. And also, this is from 2014, and the dataset is < 4GB. No reason to use hadoop.

The discussion we had here was involving TB of data, so I'm curious how this is faster with CLIs rather than parallel processing...

link

f311a 81 days ago

JQ is very convenient, even if your files are more than 100GB. I often need to extract one field from huge JSON line files, I just pipe jq to it to get results. It's slower, but implementing proper data processing will take more time.

link

rennokki 76 days ago

More than 100GB can be 101GB, 500GB or 1TB+. I was speaking about 1TB+ files. I'm not sure you can get it faster unless you have a parallel processor.

link

anonymoushn 81 days ago

are those tools known for their fast json parsers?

link

rennokki 76 days ago

If we talk about TB or PB+ scales, then yes.

link

anonymoushn 76 days ago

Oh, can you post some benchmarks? I didn't know that parser throughput per core would change with the amount of data like that.

link