| HN Mirror

That's not the dataset used for training. From the paper:

>We train our T2V model on a dataset containing 30M videos along with their text caption. [...] We evaluate our model on a collection of 113 text prompts describing diverse objects and scenes. The prompt list consists of 18 prompts assembled by us and 95 prompts used by prior works (Singer et al., 2022; Ho et al., 2022a; Blattmann et al., 2023b) (see App. B). Additionally, we employ a zero-shot evaluation protocol on the UCF101 dataset >