Hacker News new | ask | show | jobs
by zeitlupe 1184 days ago
Spark is my favorite tool to deal with jsons. It can read as many jsons – in any format located in any even nested folder structure – as you want, offers parallelization, and is great to flatten structs. I've never run into memory issues (or never ran out of workarounds) so far.
2 comments

Yeah, given that everything is now multi-core, it makes sense to use a natively parallel tool for anything compute-bound. And Spark will happily run locally and (unlike previous big data paradigms) doesn’t require excessive mental contortions.

Of course while you’re at it, you should probably just convert all your JSON into Parquet to speed up successive queries…

How much memory would a spark worker need to process a single JSON file that is 25GB?

To clarify, this is not JSONL or NDJSON file. Just a single JSON object.