| HN Mirror

Exactly. Besides cost, domain-specific models (which can still be very large) encode our biases (i.e., our knowledge about the domain) in their architecture. Because of that, we have ways to calibrate their accuracy trade-offs over an in-domain sample.

For LLMs, there is a disconnect between the perceived domain (anything a human can think and verbalize) and actual domain (word sequence prediction). We only know how to sample from the latter, not the former.

This "silver bullet" idea sounds a lot like "free lunch". There has been a lowering of the bar for ML practices to make way for this onslaught of prototype-level productivity. Teams that used to do their best to run uncorrelated evals are now having the prompt engineers manually inspect ~100 model outputs before launch.

People were shocked by the accuracy of n-gram models and user interaction data to "read their mind" and complete their searches. Now we're obviously all impressed by the emerging abilities of LLMs (integrated with lots of business logic by the LLM providers). Hopefully in time we'll get desensitized a bit and have the right mental model when using these tools.

P.S. I love your username :)