Hacker News new | ask | show | jobs
by throwthrowuknow 756 days ago
We’ll have to wait and see how far multi modal training takes us. Text only models are extremely limited by the kind of information we can encode as text and the loss of detail e.g. the word “cat” vs an image of a cat vs video of a cat vs direct physical interaction with a cat vs being a mammal that shares a great deal of biology with a cat. You need a table and beans before you can invent a method for counting them