| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by martbakler 505 days ago
	Currently it's a text-only modality environment but we are planning to support vision in the future. We did run a couple of tests and saw that including screenshots of the game state did not improve performance on the off-the-shelf models. As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because the VLMs currently aren't good at spatial reasoning in high-detailed images, likely this would improve significantly with finetuning Good point with MCP as well given it has been blowing up lately, we'll look into that!

2 comments

vessenes 505 days ago

That makes sense and it’s really interesting - it is a challenging visual test for sure; thousands of entities, either multi tier visual representations (screen, map, overview map) or a GIANT high res image. I hereby propose FLE-V a subset benchmark for visual models where they just turn a factorio image into a proper FLE description. And maybe the overview and map images as well.

link

kridsdale1 505 days ago

Such research could have hundreds of billions of dollars in downstream GDP implications when applied to real industrial settings.

link

dismalpedigree 505 days ago

Not to mention the increased productivity of everyone not wasting their time in factorio (myself included) because the optimal solution is known.

link

lukan 504 days ago

Not wasted time, you were doing research it seems.

link

dismalpedigree 503 days ago

Good point. My wife will surely understand if I explain it as “research”

link

vessenes 505 days ago

Well I better get training!

link

grayhatter 505 days ago

> As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because [...]

I think you just described a research paper that would advance sota. Less describing why, but how. (Assuming it's not just, wy finetuned the model and it worked perfectly)

link

martbakler 505 days ago

Sounds almost like a visual "needle in a haystack" type of work, that could be quite interesting!

link

pyinstallwoes 505 days ago

Where’s Waldo test for vlm

link