2.'turn the handle on the door 90 degree then pull it out'.
Just like how people would do it, a video and a piece of instruction listed and a label to indicate whether this task is a success or not. Then you show a different setting and a new instruction, if the model successfully generalize and understand the semantics behind it, it should carry out the instruction successfully.
For example:
1.'grab that red ball'
2.'turn the handle on the door 90 degree then pull it out'.
Just like how people would do it, a video and a piece of instruction listed and a label to indicate whether this task is a success or not. Then you show a different setting and a new instruction, if the model successfully generalize and understand the semantics behind it, it should carry out the instruction successfully.