| HN Mirror

If this is true, then why do you think that — as the OP article states — the developers of GPT3 chose to use non-ML-based techniques to deduplicate their dataset, when they would be the most equipped to use ML-based approaches?

Just the pure compute cost of needing to run an ML encoder over petabytes of data?

Or maybe because for their use-case — eliminating redundancy to reduce total dataset size and therefore training time — a non-domain-specific vectorization with a high-false-negative cluster-discovery rate was acceptable, because it just meant they'd "compress" the dataset slightly less well, and so get slightly more training time? (At the expense of increased bias in training toward the saliency of the features that weren't dedup'ed out; but that was going to happen regardless, and they likely already had a fully-general technique later in the pipeline for countering that.)