|
|
|
|
|
by encrux
138 days ago
|
|
Very much depends on what you want to do. The fact that a language model can „reason“ (in the LLM-slang meaning of the term) about 3D space is an interesting property. If you give a text description of a scene and ask a robot to perform a peg in hole task, modern models are able to solve them fairly easily based on movement primitives. I implemented this on a UR robot arm back in 2023 The next logical step is, instead of having the model output text (code representing movement primitives), outputting tokens in action space. This is what models like pi0 are doing. |
|
The latter part is interesting. I'm not sure how the performance of one of those would be once they are working well, but my naive gut feeling is that splitting the language part and the driving part into two delegates is cleaner, safer, faster and more predictable.