Hacker News new | ask | show | jobs
by didgetmaster 1385 days ago
Is the data unique or has it been duplicated for multiple formats? In other words is there a CSV file right alongside a Json file and an XML file that contains the exact same data, just in different formats?

Is the data partitioned at all (e.g. by state) so that you can just download the data for California without downloading all the data; loading it into a huge database table; and then querying it (e.g. SELECT * from <table> WHERE state = 'California')?

1 comments

There is some duplication, where different networks under the same carrier could benefit from normalization, but in-general duplication isn't the primary issue.

The data is partitioned for some carriers at the network level, but unless that carrier has networks that are unique to a given state it's difficult to partition by location.

The majority of the data is lumped into very large, single JSON (not newline delimited), so an initial parsing step is required to break out substructures for parallel processing via warehousing technologies. I think Aetna has a 300Gb compressed (single) json file.

After breaking the json to a single array entry per provider/network, parsing is still a bit tricky because there are some very "hot" keys. Some provider array entries may only have 1000 code and cost entries, others may have 100k. We've seen array entries >50Mb for a single provider/network/carrier.