| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zeitlupe 1231 days ago
	Spark is my favorite tool to deal with jsons. It can read as many jsons – in any format located in any even nested folder structure – as you want, offers parallelization, and is great to flatten structs. I've never run into memory issues (or never ran out of workarounds) so far.

2 comments

pidge 1231 days ago

Yeah, given that everything is now multi-core, it makes sense to use a natively parallel tool for anything compute-bound. And Spark will happily run locally and (unlike previous big data paradigms) doesn’t require excessive mental contortions.

Of course while you’re at it, you should probably just convert all your JSON into Parquet to speed up successive queries…

link

iknownothow 1231 days ago

How much memory would a spark worker need to process a single JSON file that is 25GB?

To clarify, this is not JSONL or NDJSON file. Just a single JSON object.

link