Multi armed bandit methods work best with immediate success-fail metrics. This one has time delays.
An example of how machine learning goes wrong is if a treatment slows down the progression but increases the death rate. Given exponential ramp up in the incoming cases, it will look good until the final horrifying numbers are in. You need to slice and dice the numbers by cohort to detect/react to this.
I decided that some numbers on how things go wrong would help.
Suppose that the treatment increased deaths by 50% but delayed death by a week. And we have a doubling rate for the disease of 1 week.
Back of the envelope that means that the treatment will have 1.5x the deaths from when the disease happened 0.5 times as much for 0.75 of the deaths at any point in time. It looks like it saves 25% of lives when in fact it kills 50% more people. The raw numbers will look good until you look at a cohort over time.
Current doubling time for deaths has been about 3 days. My assumption of a week is therefore optimistic. Perhaps we get there with social distancing.
"Multi armed bandit methods work best with immediate success-fail metrics. This one has time delays."
Well, sure, but everything works best with immediate success-fail metrics. That's one of the most basic results from learning theory is that the longer the latency between stimulus and response the slower the learning rate can be. I'm not sure how multi-armed bandit is special in this regard in any particular dimension. All learning techniques are going to be susceptible to the problem you outline in your second paragraph.
This is one of those "there is no perfect solution" situations. It's really easy to say that out loud. It's quite difficult to internalize it.
(Also, just as a note to your other post, bear in mind that our hard-core "social distancing" efforts in the US are just about to reach approx. 1 incubation period. It is only just this week that we're going to start seeing the results of that, and it'll phase in as slowly as our efforts 1-2 weeks ago did. My state just went to full lockdown today, though we've been on a looser lockdown for a week before that.)
Everything works better with immediate success/fail metrics. However the simplest approach is easiest to analyze, and is easiest to analyze after the fact in as many ways as you want. The more complex the decision making, the less we should be willing to put it under the control of a computer program. (Unless that program has been well-studied for our exact problem so that we trust it more.)
Which medicine looks effective? Which medicine gets people out of the hospital faster? What underlying conditions interacted badly with given medicines? These questions do not have to be asked up front. But they can be answered afterwards. And knowing the answers, matters.
Here is an example. Suppose that we find one medication that gets people out of bed faster but kills some. In areas with overwhelmed hospitals, cycling people through the bed may save net lives. If your hospital is not overwhelmed, you wouldn't want to give that medicine. Now I'm not saying that any of these medicines will come to a conclusion like that. But they could. And if one did, I definitely want human judgement to be applied about when to use it
I don't think anyone is proposing actually removing all humans from the loop, so I think that's an argument against a strawman.
Even if they were proposing it, there's no realistic chance of it happening.
I don't want people blindly copying "standard" scientific procedures either, where we run high-stastistical-power studies for months with double-blind scenarios then carefully peer-review it and come up with some result somewhere in 2022.
So, hopefully there will be blinded researchers who analyse the data.
They'll probably use sequential stopping rules to take samples of incoming data.
If one of the treatments works much much better, then they'll almost certainly recommend that (but doctors will probably figure this out first, anyway).
in a world where you have many options and have to figure out which is best by repeated experimentation, but where experimentation itself has some cost, you have a multi-armed bandit problem. (the name is supposed to evoke a room full of slot machines -- you want to find the one with the highest payouts by repeatedly playing them, while losing as little money as possible before you find it.)
for example, if you have a few medications, you might start by trying them all equally at random and then as data comes in, use a bandit algorithm to gradually shift more and more new patients onto the ones that prove most effective, in a way that optimally trades off accurately estimating the effects with wasting time testing the less effective drugs.
interestingly, the first formulation of the problem is due to Dr. Thompson at the Yale Pathology Department in the 1930s; he came up with Thompson sampling. So these are techniques that were originally designed for medical trials.
I think that designers of medical trials probably do have a good grasp of this stuff (some statistical estimators that originated in the medical world have even been successfully imported into reinforcement learning/MAB research) so probably they would be using a bandit-like technique if they felt it made sense.
Every patient would be treated with random drug/treatment. With accumulated treatment results, a multi armed bandit algorithm would adjust probabilities so, that most effective treatment would be used more often.
For example, in Thompson sampling probability of choosing option is equal to probability of that option being the best option given evidence so far.
Aim is to maximize reward (successful treatments), while spending little as possible time on exploration (testing less effective treatments).
An example of how machine learning goes wrong is if a treatment slows down the progression but increases the death rate. Given exponential ramp up in the incoming cases, it will look good until the final horrifying numbers are in. You need to slice and dice the numbers by cohort to detect/react to this.