Yeah! You can mainly work with images and arbitrary vectors (that's what the bounding box examples we show are using, for instance) currently, and have plans to include native support for text, video, etc. as time progresses.
I guess another question here is what are heuristics for how many images are necessary for different levels of functionality. The demos look pretty impressive, but I'm not sure how much went into them.
We've been surprised how little data folks have needed to use. If you look at the examples page you'll see in the lower right hand corner of the screen shot the number of examples they uploaded and trained on. Some examples, like the water tank, it's fine to some extent if it overfits on the training data, because the nest cam will only ever be pointed at the water tank, and it's worked in all situations and been robust for us with only ~500 examples. Other times folks are more interested in prototyping out an idea to see if it's possible on a wider scale, so a small dataset works well to prove out an idea.
It goes the other way as well and supports generation.