| HN Mirror

I forget the exact numbers, but a single year's worth of Medicare part D claims data will be on the order 1TB. That doesn't include the beneficiary and provider datasets (which links patients and doctors) which you'll need to join against. Also when detecting fraud like this, you may want to include the other Medicare parts (A, B, C) which are oftentimes larger than part D (being that D is the newest). So this leaves you manipulating on the order of 10TB for single year analysis. Finally, since Medicare bills can be corrected up to 3 years, you may end up joining multi-terabyte datasets.