| I am really interested if that really matters. Package managers often comes with rating system. npmjs has weekly downloads, pull requests, and other popularity scores. I am layman in AI, but why would anyone think that this would affect anything, like AI? Why would anyone train on noname package, that noone uses? Stats for spam packages can have higher-than-none stats, but that also makes them vulnerable for sweep removal of all potential spam packages, since they are connected, etc. etc. Any credible company will not use a noname spam package, will verify their contents. That is at least what happened in all companies I have worked for. |
…almost certainly for the same reason that any “train AI using only good data, reduce hallucinations!” suggestion is in the “daydream” rather than “great idea” category.
Creating high quality filtered datasets is enormously more time consuming and expensive than just dumping everything you can get you hands on in.
It seems obvious to ignore packages that are obviously unused and spam, but tldr; no idiot is going to be pouring spam into npm unless there’s some kind of benefit from it; people accidentally using it, mixing it into the dependency tree of legit packages, etc.
It’s more likely that the successful folk doing this aren’t being caught, and the ones being caught are “me too” idiots. Or, the spam is working and people are actually (for whatever incomprehensible reason) actually using at least some of the packages.
TLDR; if dependency auditing and supply chain attack were trivial to solve, it wouldn’t be a problem.
…but based on the fact that we continue endlessly to see these issues, you can assume that it’s probably more diff to solve than it trivially appears.