Hacker News new | ask | show | jobs
by jcahill 2163 days ago
GPT-3 is a neat party trick. But the things that'll be done with web archives* in the next 20y will make it look like the PDP-8. ~love, a web archivist

* GPT-3 is trained on one

4 comments

The transformer model as presented in GPT-3 may be a few tweaks away from a human-acceptable reasoning, at which point we may realize that human brain is just a neat party trick as well. This may come difficult for some people to internalize, especially those who understand the technology in depth. Because it means that the medium of our reality is the consciousness.
Was this comment generated by GPT-3?
I doubted that as well, but I don't think it is--at least it's not a simple copy paste. There's an emphasis on _is_ in the last sentence which I don't think the algorithm could have generated.

However that makes one wonder if it can also learn to generate emphases, and if so, how would it format? With voice generation it can simply change its tonality but with text generation it has to demarcate it in some way--does the human say "format the output for html", for instance?

You are confusing pattern matching with reasoning. If your brain was replaced by GPT-3 model and you were cast away on a distant island, I highly doubt you will be able to perceive, plan and prosper during your survival against all the calamity nature would through at you.
To be honest, most city-raised humans wouldn't be able to survive on a distant island as well.
The transformer model in GPT-3 has a short context window and no recurrence. Without some significant architecture changes that is a fundamental limit on the problems GPT-3 can solve.
> Because it means that the medium of our reality is the consciousness.

I agree. The environment - as the source of learning and forming concepts, is the key ingredient of consciousness, not the brain.

I don't fully understand what you're getting at here...

Basically the brain and "consciousness" isn't as fancy as we think?

Exactly.
No pressure: feel free to ignore me, please. Would you mind elaborating? I'm interested in what you have to say (and, of course, feel free to say it privately if you prefer). I would like to even hear your dreams, wild speculations, or gut feelings about the matter.
Sure, what do you want to know?

I currently work on synbio × web archival.

Some of us are cooking up futuretech aimed at storing all of IA (archive.org) in a shoebox. Others are working on putting archival tools in more normal web users' hands, and making those tools do things that people tend to value more in the short-term, like help them understand what they're researching, rather than merely stash pages.

My ambitions for web archives are outsized compared to other archivists, but I'm fine with that. I'm looking beyond web archives as we currently understand them toward web archives as something else that doesn't quite exist yet: everyday artefacts, colocated and integrated with other web technology to an extent that they serve in essential sensemaking, workflow, and maybe security roles.

Right now, some obvious, pressing priorities are (a) preserving vastly more content and (b) doing more with the archives themselves.

A: The overwhelming majority of born-digital content is lost within a far narrower time-slice than would admit preservation at current rates, and data growth is accelerating beyond the reach of conventional storage media. So, for me, the world's current largest x is never the true object of my desire. I'm after a way to hold the world that is and the world to come.

Ideally, that world to come is one where lifelong data stewardship of everything from your own genome to your digital footprint is ubiquitously available and loss of information has been largely rendered optional.

This, of course, requires magic storage density that simply defies fundamental limitations of conventional storage media. I'm strongly confident that we're getting early glimpses of the first real Magic contenders. All lie outside, or on the far periphery of, the evolutionary tree that got us the storage media we have today. For instance, I'm running an art exhibition that involves encoding all the works on DNA.

B: Distributed archival that comes almost as naturally as browsing is well within reach, and with that comes some very new potential for distributed computation on archives. One hand washes the other.

One important thing to realize here is that, in many cases, you can name a very small handful of individuals as the reason why current archival resources exist. GPT-3 is cracking the surface by training on data produced by one guy named Sebastian, for instance.

…i'm sorta tired and have to respond to something about every twitter snapshot since June being broken, though, so I'll pick this back up later.

This is an interesting thought. GPT-3 used 45TB of raw CommonCrawl data (which was filtered down to 570GB prior to training). The Internet Archive has 48PB of raw data.
That 48PB is mostly just old video game roms and isos though
Hopefully in a way that secures some funding for those making archives of the web.
I'm running the Coronavirus Archive. Largest thematic archive on the pandemic, since January. I'm also teaching community biolab techniques to people in parts of the world without ready access to commercial COVID-19 test kits, on all but zero resources at this point.

I could use… what's the word? I think it's more funding.