It means Millions of tokens. Token in this context means either text token (as in tokenization https://en.wikipedia.org/wiki/Large_language_model#Probabili...) or video token where the paper describes them as "each frame in the video is tokenized with VQGAN into 256 tokens." (p. 6)