|
|
|
|
|
by ericd
1655 days ago
|
|
Hm not an expert in this, but would something with a world model help, rather than depending on stochastic random action choices? It seems like it should be possible to learn that a frame sequence where you've been next to a bomb for 6 ticks is rapidly decreasing your expected score, and that your score would be significantly better if you weren't in line with the bomb pretty soon. |
|
The classifier has on average 90% accuracy which seems good. I then use the likelihood predicted by this classifier to compute the weight with which I want to train each action and if I want to train it positively (by pulling its likelihood of being chosen up) or negatively (pushing the likelihood of that action down).
However, what this model cannot correctly represent is the fact that whether or not a given situation will turn out to be good or bad in the long term is highly dependent on how you play. So if I train this with replay data, I will score the situations in relation to how well those (outdated) AIs could take advantage of them.
Next up, I'll try to fix this issue by introducing a graph-like stochastic structure. The basic idea is that I encode "from this state S if I take action A, then I can reach state T with P percent likelihood" into yet another neural network. If I then identify a state which is really beneficial in the sense that I can reliably convert it into an advantage, then I can use this graph to back-propagate that knowledge so that I get "from this state S, action A takes me to state T, then action B takes me to state U, and U is great".
That should allow me to train with historical data to identify which transitions are possible, and then I can combine that with realtime data about the desirability of each state. So basically I'd do A* pathfinding over the graph of possible states to identify which actions are needed to bring me from my current situation into the closest "I will surely win" situation. Except that the graph is memorized by an AI because the real state-space is huge: 15x15 fields with 6 units + 5 environment states => roughly 11^(15*15) states