Hacker News new | ask | show | jobs
by tantalic 1209 days ago
At what point do we ask if training from “datasets crawled from the internet” is itself the greater poison?
3 comments

The internet is the representation of the human "meta-mind".

Organisations are seen as a slow form of AI. Their decision making is different to what each individual would make so it represents a different form of "mind".

All humanity (to some definition of all) is also a "mind" - its currently trying to decide on problems like "climate chnage"

The workings of that mind, a brain scan if you like, is the internet. It's a map of the state of each neuron (my twitter history?) and the interconnections between those are how the brain thinks. And we can see into the workings of that mind, and indeed alter it.

AI trained from that "brain scan" is simply an model of human meta mind we can play with faster.

Any problems with ChatGPT are therefore problems with humanity.

Maybe

It's a representation, not the representation.

By looking at the internet, especially web 2 content, you're getting what the engagement algorithms have decided is good for advertisers.

There's plenty of stuff that humanity does that the internet does not incentivize and thus has no representation for

Yea, this.

The same point or a similar critique can be made a few ways, I’d say.

Running with the brain/neuron analogy, there’s a measurement problem (as there is in real neuroscience!). The synaptic activity of the “meta-mind” has been recorded with keyboards, smart phones and plain text. These aren’t the native ways of human communication though, the synapses if you will. That’s more like spoken conversation and physical interaction. All richer phenomena.

To the extent that “textual” communication is now native/normal to humanity, it’s still partial in coverage of all human interaction, new, and shifting with tech developments like video/streaming.

So the internet is a lossy representation, apart from whatever other biases it might have, as suggested above.

Do the datasets follow the algorithmic weightings? I thought they included all content for their domains without weighting by popularity / engagement algorithm.
The internet has about 67% of the world's users, that leaves about 2-3bn not represented. And among those, only about 0.001% actually post and contribute content that is available on the open web, and I'm willing to bet that population of contributors does not represent the world demographic
Remember, the map is not the territory. And we have many types of maps for many kinds of specialized purposes.

https://en.wikipedia.org/wiki/Map%E2%80%93territory_relation

The other day I read that models like stable diffusion can be windows into the human collective subconscious. Not sure if I agree but it's an interesting theory.
The collective unconscious is defined as the shared mental concepts, or archetypes, of humanity/the noosphere. I'd say this perspective is less a theory and more a rephrasing.

https://en.wikipedia.org/wiki/Collective_unconscious

I wonder the same. Also scraping for training data feels like something that should be opt in. I really have a problem with the stance that just because a piece of data is technically accessible, that it’s fair game. It also undermines the lineage and trustworthiness of the final model e.g. how does one verify that a model’s predictive outcomes are in line with expectations.
Conversely, an opt-in dataset would surely consist of 99.99% spam.
I think that's easily avoidable - one wouldn't reach out to "spammy" sources in the first place.
Legally, it seems to me that this is as poisonous as "take a shot of everything in your kitchen cupboard and mix it up". It relies on handwaving away both copyright and GDPR concerns.