How about the same problem, but with each Jar having:
a) a known fail percentage - 40% of the time the Jar fails and produces nothing. Maximize expected win.
b) an unknown fail percentage, evenly distributed between 0 and 100%. Find a strategy that maximizes expected win over many runs (each run has new fail probabilities), by perfectly balancing between exploration of jars and exploitation. If you can find an optimal (and practical) strategy for this one I applaud you!
Also, I solved your example with a simple Python brute-forcer with < 1s run time. I don't know if I care enough to write a parser of your file format just to mail it in ;).
Under 1 sec is amazing. When you say brute force, do you mean that for the first step (TS 1), you took all the three routes available, [0,0], [1,0], [2,0] for jar 1 and 2. Cause that just makes the number of branches too many to be solved in a reasonable amount of time. I would love to have a peek at you code.
It is real brute force, but I do not investigate some stupid nodes. If I choose to not use an available jar during a timestep, I disqualify that jar, until another jar has been used, since there is no point in waiting to use a jar unless you want to save your balls for another jar.
The code is ugly and undocumented. I think the same could be accomplished in less than 10 lines of Haskell :).
http://pastebin.com/MfXK9fwS
An implementation detail is that my branches are actually "jar1" "jar2" and "stepforward". To use a jar many times, you do jar1, jar1, stepforward.
As my first erlang program, I tried to solve this problem. The code is very ugly, and rather long for my taste, but it solves the thing in 917 steps. I also took the case where if no output has been gotten in a particular timestep, then do not consider any jars.