| At work we were facing this dilemna. Our team is working on a model to detect fraud/scam messages, in production it needs to label ~500k messages a day at low cost. We wanted to train a basic gbt/BERT model to run locally but we considered using GPT-4 as an label source instead of our usual human labelers. For us human labeling is suprisingly cheap, the main advantage of GPT-4 would be that it would be much faster, since scams are always changing we could general new labels regularly and be continuously retraining our model. In the end we didn't go down that route, there were several problems: - GPT-4 accuracy wasn't as good as human labelers. I believe this is because scam messages are intentionally tricky, and require a much more general understanding of the world compared to the datasets used in this article which feature simpler labeling problems. Also, I don't trust that there was no funny business going on in generating the results for this blog, since there is clear conflict of interest with the business that owns it. - GPT-4 would be consistently fooled by certain types of scams whereas human annotators work off a consensus procedure. This could probably be solved in the future when there's a larger pool of other high-quality LLMs available, and we can pool them for consensus. - Concern that some PII information gets accidentally sent to OpenAI, of course nobody trusts that those guys will treat our customers data with any level of appropriate ethics. |