Hacker News new | ask | show | jobs
by efavdb 13 days ago
Article says this misses important details, eg data that might be in the image.
1 comments

very bad take. with most modern multomodal models you get way better performance then going to text first
it's a cost/latency trade-off in production + very use-case dependent