Hacker News new | ask | show | jobs
by Oras 1066 days ago
According to the website, the model can then fine-tuned for certain tasks such as image classification.

1. How does the multi-model help here in improving the accuracy of image classification when training is combined from text, images, and audio?

2. How about the speed? I would imagine a model with text, audio and image data would be larger compared to text-only models?