I think the real-world resolution to this problem is straightforward though. You should look at the finest level of granularity available, and pick the best treatment in the relevant subpopulation for the patient.
Unfortunately our level of certainty generally falls off as we increase the granularity. For example, imagine the patient is a 77yo Polish-American man, and we're lucky enough to have one historical result for 77yo Polish-American men. That man got treatment A and did better than expected. But say if we go out to 70-79y white men we have 1,000 people, of which 500 got treatment A and generally did significantly worse than the 500 who got treatment B. While the more granular category gives us a little information, the sample size is so small that we would be foolish to discard the less granular information.
This is all true. I originally added a disclaimer to my post that said "assuming you have enough data to support the level of granularity" but I removed it for brevity because I thought it was implied -- small sample size isn't part of Simpson's paradox. My apologies for being unclear