| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by renegat0x0 680 days ago

I am really interested if that really matters.

Package managers often comes with rating system. npmjs has weekly downloads, pull requests, and other popularity scores.

I am layman in AI, but why would anyone think that this would affect anything, like AI? Why would anyone train on noname package, that noone uses?

Stats for spam packages can have higher-than-none stats, but that also makes them vulnerable for sweep removal of all potential spam packages, since they are connected, etc. etc.

Any credible company will not use a noname spam package, will verify their contents. That is at least what happened in all companies I have worked for.

4 comments

wokwokwok 680 days ago

> why would anyone think that this would affect anything, like AI? Why would anyone train on noname package, that noone uses?

…almost certainly for the same reason that any “train AI using only good data, reduce hallucinations!” suggestion is in the “daydream” rather than “great idea” category.

Creating high quality filtered datasets is enormously more time consuming and expensive than just dumping everything you can get you hands on in.

It seems obvious to ignore packages that are obviously unused and spam, but tldr; no idiot is going to be pouring spam into npm unless there’s some kind of benefit from it; people accidentally using it, mixing it into the dependency tree of legit packages, etc.

It’s more likely that the successful folk doing this aren’t being caught, and the ones being caught are “me too” idiots. Or, the spam is working and people are actually (for whatever incomprehensible reason) actually using at least some of the packages.

TLDR; if dependency auditing and supply chain attack were trivial to solve, it wouldn’t be a problem.

…but based on the fact that we continue endlessly to see these issues, you can assume that it’s probably more diff to solve than it trivially appears.

link

andai 680 days ago

Daydream? It worked for Phi.

link

wokwokwok 680 days ago

This is such a low effort insincere comment I can barely be bothered to respond to it… but tldr; no, it didn’t.

If it was easy, people would have done it. It’s not easy. Phi is not a state of the art model. It does not perform significantly better or even on par with larger models.

Yes, I’ve read the tech reports and used it. No, I don’t believe it has any kind of meaningful bearing on the problem, which is explicitly in question here, which I explicitly posit, again, is basically unsolvable:

Given a large user contributed repository of code (npm), it’s very hard to determine “good” from “bad” in terms of quality at scale, when you have malicious actors.

…I mean, it’s not impossible with enough time and effort I suppose, but if Microsoft, who own npm have a good way of filtering out bad content on it for their language models, you’ve really got to ask why the duck they’re using it for their language models, and you know, not to unduck npm…

link

andai 679 days ago

I'm confused. Are you saying that removing low quality inputs from training data doesn't improve a model? (Or conversely, adding high quality inputs.) Or are you saying that we don't yet have the technology to reliably do this at scale?

link

wokwokwok 679 days ago

I again, can’t comprehend how this can possibly be ambiguous from my comment, but the second one.

We don’t (by all accounts, no one does) have a way to create this kind of dataset at scale, in this kind of complex user contributed content environment (specifically npm and other places like it).

link

andai 679 days ago

Microsoft's curation techniques for the Phi models remain proprietary. So we can't really criticize or praise their methods, because we don't know what they are. It might be GPT-4. It might be Artificial Artificial Intelligence (a warehouse in Pakistan). But the results speak for themselves.

The models are a bit janky in my testing (especially prone to leaking test materials, and highly specialized on a narrow domain), but fantastic for their size.

Intentional "under-generalization" seems like a fairly self-evident approach to making optimal (and economical, on the training side) use of smaller models.

As for whether it works for a general purpose model, my intuition says that it does (i.e. cutting off the "long tail of knowledge" in favour of a better handling of the mainstream, by the limited neurons available).

As for whether that tech exists, I reckon a simple tf-idf would get you 80% of those wins, but that might be ignorance/arrogance on my part.

link

Too 680 days ago

If you look at the purpose of this Tea protocol it is exactly to provide a chain of credibility. Though, by connecting ranking with monetization, tea has created perverse incentives, leading spammers to pump up their tea ranking, by linking and starring packages in circles. Their goal is to make it look like it’s a highly used package.

Luckily, nobody thinks that tea ranking matters, except for the spammers themselves.

They are with no doubt attempting to poke at other more established metrics as well. This could eventually fool an AI or even humans.

link

mrweasel 680 days ago

> Why would anyone train on noname package, that noone uses?

Not that I disagree, but in the same line of thinking: Why would anyone train an LLM on some random blog written in broken English? Why would you train an LLM on the absolute dumpster fire that is Reddit comments? Or why is my Github repos with half-finished projects and horribly insecure coding practises being used as input to CoPilot? Yet here we are, LLMs writing broken, insecure code (just like a real person) and telling people to eat rocks.

link

yas_hmaheshwari 680 days ago

Agree! Not only in companies, but I have never seen anyone download a package, without looking at Github stars

The real fun would happen if the next incentive is to publish a package and get Github stars for that repo :-)

link