| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ankitrohatgi 3164 days ago
	I am curious about how the query performance compares to working with JSON files in Spark for ~100GB data.

3 comments

jakebol 3164 days ago

Jake from TileDB, Inc. here: Depending on the structure of the JSON files you are querying you maybe able to take advantage of columnar compression and massively reduce the dataset size (especially if the json files contain numeric data). Also, repeat queries will not have to re-parse the JSON files. This may speed up queries quite a lot, but it depends on the specifics of your problem.

link

StanSeltser 3164 days ago

Stanislav Seltser, Petacube you are talking comparing structured workload(array-based TileDB) to unstructured one (JSON+Spark). Once you convert your JSON to sparce array structure (one time conversion) TileDB will beat Spark+JSON by several orders of magnitude. Caveat: assuming your spark+json workoad is a some heayy processing not a lightweight one.

link

maxpert 3164 days ago

They have some benchmarks in paper https://people.csail.mit.edu/stavrosp/papers/vldb2017/VLDB17...

link