Hacker News new | ask | show | jobs
by sajacy 1380 days ago
So - veteran data wrangler here. I skimmed some of Humana's files. There's lots of repetition that can easily be removed when converting from a raw input to an analytical dataset - basically the huge blocks of text in "BILL_TYPE_CODE: 130,139,..." in the ADDITIONAL_INFO field can be normalized away by building a quasi-Huffman-encoded lookup table.

Noteworthy(?): there seems to be a limit of ~100~ 140 sets of prices, as seen in the filenames:

2022-08-25_NNN_in-network-rates_0000000XXXXX.csv.gz

~Did I miss something? ... or is this some kind of technical limitation for Humana?~ Edit: I missed the alphabetical ordering. Still, only about 140 price sets.

Also, each plan member's JSON file has a small chunk of useful information, then a useless list of all 15k gz parts of a relevant NN_in-network-rates file (you only need the first filename to figure out which NN to reference).

For these files, you can use Range requests to download only the first, say, 50KB, and pipe it to gunzip and jq. (https://github.com/stedolan/jq/issues/31#issuecomment-900184...)

I would also be interested in helping throw such an analytical dataset into BigQuery. It'll be great for sharing an open dataset. No doubt this will still be a gigantic headache, but it is tractable.

1 comments

Love the insights! Making a note to dig deeper into this. Feel free to reach out to me via email as well if you want to discuss more: alec@dolthub.com