Hacker News new | ask | show | jobs
by ipsin 4013 days ago
Does analyzing this actually require joining in large data sets -- that is, larger than will fit on a single machine?

I'd always assumed that the records involved weren't very large, but I don't know much about the problem space, so I'm not sure if other data gets joined in in a way that benefits from cluster-based analysis.

2 comments

I forget the exact numbers, but a single year's worth of Medicare part D claims data will be on the order 1TB. That doesn't include the beneficiary and provider datasets (which links patients and doctors) which you'll need to join against. Also when detecting fraud like this, you may want to include the other Medicare parts (A, B, C) which are oftentimes larger than part D (being that D is the newest). So this leaves you manipulating on the order of 10TB for single year analysis. Finally, since Medicare bills can be corrected up to 3 years, you may end up joining multi-terabyte datasets.
Medicare is one of the largest health schemes in the world in an industry known for massive amounts of paperwork. It's a humongous data set.