Hacker News new | ask | show | jobs
by z4y5f3 905 days ago
Unfortunately GZIP won't beat LLMs for text classification. The research you cited is just poorly done science that has been widely debunked. The original paper compared top-2 accuracy of GZIP with top-1 accuracy with BERT. The dataset also contains a lot of train/test data leakage. See this article for the rebuttal: https://kenschutte.com/gzip-knn-paper/ and this thread for a previous discussion on hackernews: https://news.ycombinator.com/item?id=36758433.

Further, the evidence presented by NYT in the lawsuit could be hard to reproduce. I tried multiple prompts on multiple versions of GPT-4 APIs but still could not get GPT-4 to reproduce NYT articles exactly. NYT might as well tried to let GPT-4 reproduce 100,000 articles and only found a few cases where GPT-4 actually recited the whole article. In that case OpenAI might as well be arguing that this is only a rare bug and avoid losing the lawsuit in a massive way.