Hacker News new | ask | show | jobs
by ankitrohatgi 3164 days ago
I am curious about how the query performance compares to working with JSON files in Spark for ~100GB data.
3 comments

Jake from TileDB, Inc. here: Depending on the structure of the JSON files you are querying you maybe able to take advantage of columnar compression and massively reduce the dataset size (especially if the json files contain numeric data). Also, repeat queries will not have to re-parse the JSON files. This may speed up queries quite a lot, but it depends on the specifics of your problem.
Stanislav Seltser, Petacube you are talking comparing structured workload(array-based TileDB) to unstructured one (JSON+Spark). Once you convert your JSON to sparce array structure (one time conversion) TileDB will beat Spark+JSON by several orders of magnitude. Caveat: assuming your spark+json workoad is a some heayy processing not a lightweight one.