Multimodal dataset should be multimedia: text, audio, images, video, and optionally more like sensor readings and robot actions.