| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nchmy 81 days ago

This isn't for you then

> The query language is deliberately less expressive than jq's. jsongrep is a search tool, not a transformation tool-- it finds values but doesn't compute new ones. There are no filters, no arithmetic, no string interpolation.

Mind me asking what sorts of TB json files you work with? Seems excessively immense.

1 comments

rennokki 81 days ago

> Uses jq for TB json files

> Hadoop: bro

> Spark: bro

> hive: bro

> data team: bro

link

eevmanu 81 days ago

made me remember this article

<https://adamdrake.com/command-line-tools-can-be-235x-faster-...>

  Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

  Conclusion: Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.

link

rennokki 76 days ago

This article is good for new programmers to understand why certain solutions are better at scale, there is no silver bullet. And also, this is from 2014, and the dataset is < 4GB. No reason to use hadoop.

The discussion we had here was involving TB of data, so I'm curious how this is faster with CLIs rather than parallel processing...

link

f311a 81 days ago

JQ is very convenient, even if your files are more than 100GB. I often need to extract one field from huge JSON line files, I just pipe jq to it to get results. It's slower, but implementing proper data processing will take more time.

link

rennokki 76 days ago

More than 100GB can be 101GB, 500GB or 1TB+. I was speaking about 1TB+ files. I'm not sure you can get it faster unless you have a parallel processor.

link

anonymoushn 81 days ago

are those tools known for their fast json parsers?

link

rennokki 76 days ago

If we talk about TB or PB+ scales, then yes.

link

anonymoushn 76 days ago

Oh, can you post some benchmarks? I didn't know that parser throughput per core would change with the amount of data like that.

link