Hacker News new | ask | show | jobs
by mskar 1383 days ago
There is some duplication, where different networks under the same carrier could benefit from normalization, but in-general duplication isn't the primary issue.

The data is partitioned for some carriers at the network level, but unless that carrier has networks that are unique to a given state it's difficult to partition by location.

The majority of the data is lumped into very large, single JSON (not newline delimited), so an initial parsing step is required to break out substructures for parallel processing via warehousing technologies. I think Aetna has a 300Gb compressed (single) json file.

After breaking the json to a single array entry per provider/network, parsing is still a bit tricky because there are some very "hot" keys. Some provider array entries may only have 1000 code and cost entries, others may have 100k. We've seen array entries >50Mb for a single provider/network/carrier.