|
|
|
|
|
by martbakler
458 days ago
|
|
Currently it's a text-only modality environment but we are planning to support vision in the future. We did run a couple of tests and saw that including screenshots of the game state did not improve performance on the off-the-shelf models. As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because the VLMs currently aren't good at spatial reasoning in high-detailed images, likely this would improve significantly with finetuning Good point with MCP as well given it has been blowing up lately, we'll look into that! |
|