|
|
|
|
|
by naed90
591 days ago
|
|
Hey, developer of Oasis here! You are very correct. Here are a few points:
1. We trained the model on a context window of even 30 sec. What's the problem? It barely pays any attention to frames beyond the past few ones. This certainly makes sense as it's a question of the loss function of the model during training. We are running now many different training runs to experiment with a better loss func (and datamix) to solve this issue. You'll see newer versions soon!
2. In the long term, we believe the "ultimate" solution is 2 models: 1 model that maintains game state + 1 model that turns that state into pixel. Think of it as having the first model be something resembling more of an LLM that gets the current state + user action and produces the new state, and then the second model being a diffusion model that takes from this state and maps to pixels. This would win the best of both worlds. |
|