Hacker News new | ask | show | jobs
The one database/file/zip to save humanity, what is it?
2 points by eliasgriffin 984 days ago
Hypothetical:

The world is immersed in a global war that has gone digital. Competing Nation-State AIs are waging war against one another in order to deprive perceived enemies of all information.

You have a laptop, able to connect to different networks anywhere in the world, and you have room for one ~100GB database file, or like, that you intend to one-shot peer to as many people as possible.

Which file is it?

9 comments

Related:

Major cyber attack could cost the world $3.5 trillion

https://www.reuters.com/business/major-cyber-attack-could-co...

"The global interconnectedness of cyber means it is too substantial a risk for one sector to face alone and therefore we must continue to share knowledge, expertise and innovative ideas across government, industry and the insurance market to ensure we build society’s resilience against the potential scale of this risk,” Lloyd's chairman Bruce Carnegie-Brown said.

- wikipedia

- arxiv

- stackoverflow (with the other specialized subdomains)

- the-eye

But the eye alone is already around 10 PB so I guess we need more storage.

If you want source code of such an AI, that's a different story. Violence begets only violence, even the most simple neural net realizes that. So an AI, no matter if it's based on reinforcement learning or evolution, won't deprive itself from its own lifeblood. It would rather decide that some parts of knowledge has to be censored of kept secret, and very likely this will be all knowledge about cyber defense, pentesting and exploit development.

So I guess it would make most sense to preserve the CVE database, golang ecosystem, PoC or GTFO, and source codes of vx-underground and tutorials of ransomware gangs for educational purposes.

Thank you for this awareness! I had to look up so many things you mentioned. I had some mind-explosions analyzing the-eye project. First of all, they were archiving 20k subreddits daily right up until Reddit changed its API. Maybe the-eye was the real reason!

Arxiv for sure. That was the first place I went, however no ability to download the entire database and scraping it would be a project on its own. It begs the question how can we as a civilization "own" our own data and not "rent" it through a website or paywall?

Which brings me to LexisNexus. That is a knowledge store we should all be able to access, crucial information about how the world works! It's only available for Governments and Law Firms...

In my hypothetical the Stackoverflow suggestion is on-point and would also be invaluable right in line with your AI strategem. We'd need some way to categorize, and order all that information to make it applicable would be the challenge I would think. Further, using your game theory of AI, it would need to also be in defense computer system ingestible maybe in real-time, so it would not be available/discoverable/decipherable/hackable up until the moment the data needed to be used. Some kind of OCR oracle for paper to digital "process" for AI defense.

I'm still following your breadcrumbs, I could say more!

The point behind a cyber defense system that is incorruptible is that it needs to find ways to represent learning mechanisms in an unsupervised manner. The only thing that is really unsupervised (as of state today) is an evolutionary concept, that is why I stuck with ES/HyperNEAT in the past. NEAT as a concept allows to implement flexible adapters for all kinds of things, while also allowing to have a time aware strategy planning neural network (the CPPN) if it's e.g. based on an unfolding LSTM layer(s). The reason why NEAT fits so nicely for this use case is the predictability of results of given tasks, and the possibility to change the neural layer structures based on the tasks and agents, too. So in our case we built e.g. LSTM layers for data structures that need time awareness, while e.g. a bayesian layer makes sense to represent an index map that refers to linked knowledge tree branches.

The previous implementation also featured a custom kademlia DHT which was implemented in a shrinkable and fast-forwardable manner so that the generation of "already outdated solutions to identical problems" could be just skipped ahead directly to the leaves of the knowledge tree to save redundant computation time.

And the only swarm intelligence that is well researched enough to represent a delegated learning mechanism based on bayesian/statistical truths are bee swarms [1] :)

In case you are interested in a project like this, that's basically what we are building @ tholian.network

[1] Check out Honeybee Democracy by Thomas Seeley

> the-eye

Cannot find anything useful by googling it. Can you please provide the link?

Don't want them to get bot traffic, but here you go:

the-eye dot eu slash public

C'mon, there has got to be an information store out there maximal to the preservation of human knowledge better than Wikipedia. If I were an Alien I would be stunned. Such technological advancement but no one can name "the database"?

There used to be an open source database file around 74-76GB db devs/admins trained on, encyclopedic, it was well known in that crowd. Anyone know what I mean? I know that's not much to go on but I cannot for the life of me remember the name, keywords, or find it on the internet.

Represent HN!

Some thoughts.

I should give this topic some time to brew an answer, but at some point we ought to ask ourselves, why don't we have such a thing?

If we were to construct this database, should it not be on the level of the Human Genome Project, but open sourced? Should it not be more fundamental to have this easily transmitted database, maybe even Unified Design, than Defense Budgets and the Billions that must've gone into the Antarctic GitHub vault? Ideally, the platform of world-wide human ownership of this project should have been initialized and set up for us to complete by, like, I dunno the U.N. or like funded by all Governments and Corporations as a good faith initiative?

Tangentially, let's also suppose that we are not alone in the Cosmos is probable enough to merit effort to enter the next phase of simply sending a Clarke-like DNA/Planet position plaque? Should we not prove our evolution and broadcast this supposed database into space in a loop? The Ascension Beacon?

The excellent comment @speedghost got me thinking about the human interface with AI from a different perspective. Why is the AI public focus on language models, but not knowledge models, how to think about thinking. Is that not "the trick"? Could Humanity not use this itself to better learn to learn? How to categorize, or order, knowledge about knowledge (metadata) is "thetadata" (thinking data)?

Thoughts about a potential Human Knowledge Database. Wikipedia is a mess and is not fit for this purpose in many ways, but the main one is contains sparse info on how to do anything. If the goal is to make the database the most efficient (small size) as it can be, and Unified Design, for all Humanity, we might need some kind of new "motion illustration" format that consists of pictures that are flat files but can be post-sequenced, to create a "movie" with the lowest possible file size, and using vector graphics only? Brainstorming.

That makes me think the database should be some sort of flat file database that is also indexed, and that index can be easily Human parsed or DB software parsed. Quite a feat for 100GB. That is not my lane, maybe it already exists.

I'm holding out hoping there is literally one person out there who already has the seed for this, maybe even already has a project going.

I would have wikipedia and a dump of some of the most important research papers (from sci-hub?).

If size isn't a limit, a copy of the latest common crawl dataset.

If size is really restricted and it's only one file, then I would seriously consider LLaMa2 70B.

It hallucinates, but in terms of knowledge in about 100GB I don't think you can find anything better.

My literal first thought was that at the very least, someone had already collated SciHub. I would consider that maybe the most essential piece of the database.

Your LLaMa2 suggestion was very thought provoking and meritorious, there might be some path forward with something like that, even if for some neutral knowledge steward AI to be the Interface of the Database.

AIDBI?

Wikipedia is overrated, I would stockpile encyclopedia PDFs instead. There is much richer data to be found in them.
Is the goal in this scenario preservation of human knowledge for rebuilding society, or trying to end the conflict?
Several possible scopes for this database:

1. Foundational - Able to bedrock any other knowledge

2. Comprehensive Civilization - Cover continuation of current society, rebuild info,

3. Comprehensive - Cover continuation of current civilization, rebuilding of civ, Cybersecurity Offensive and Defensive.

4. Extinction Insurance - A savings account for Human Knowledge, a survival toolkit, safe haven, Re-Human Seed File.

It would probably be determined by the process taken as something like this has never been attempted to my knowledge. As a Software Engineer I would Plan A for 4, have a dedicated team for 1 finishing early, to carry that team over and use that knowledge to aim for 3, in which case 2 might be the result.

In a global war, databases won't be worth anything; save some seeds and gold instead.
Information is the most valuable thing in existence, because existence is information! Databases will be even more valuable in a War.

As valuable as Gold is, there is a precious metal that is not only inherently valuable as the most conductive singular natural-form material in existence, it is also an antibiotic - Silver.

I will never understand the idea of gold as a supposedly crisis-proof currency.

How is hoarding pieces of a shiny but for the most part inherently worthless metal going to help you during a global war?

If you're alluding to an all-out economic and societal collapse: In that case, gold will be just as useless as legal tender. Rather, under such extreme circumstances, we'd probably see a reversion to a barter economy.

> I will never understand the idea of gold as a supposedly crisis-proof currency.

Doesn't gold have some unique material properties that will likely always be in demand? It is malleable and ductile, inert/doesn't oxidize, I believe it is resistant to fatigue, is a good conductor, and lots of others. And as a bonus, it is shiny and pretty. I'm not sure how many of those properties really matter in an active-war Mad Max scenario, but gold has been valued by societies throughout history for good reasons.

If the shit really hits the fan, legal tender will only have some use as maybe toilet paper and kindling.

Those are physical properties rather than economic ones. Gold not oxidising (and metal elements in general being durable) makes it a natural medium for exchanging value. However, that value still is imaginary.

Without being backed by a country / an economic system, and the mere belief that it is valuable, gold has very little value, especially when considering what essentially is a doomsday scenario, in which people probably have more pressing needs than refining and processing gold.

Thank you for your response. Based on your comment, you seem to be operating close to the idea that economic value comes from government dictate. If this is the case, I suppose there's no reconciling our respective positions. (And no problem with that, you're entitled to your opinion)

My belief is that all value is subjective in nature. Economic value is not a property of nature: it's just subjective. A thing has value in a trade not because some authority says it has value, but because everybody values what they receive in a trade more than what they give up.

> economic value comes from government dictate

Quite to the contrary. Economic value originates from the benefit a product or service provides, nothing else.

Currency (particularly if of the fiat money variety) on the other hand, has no inherent value: You can't eat it, you can't use it to build something. Its value depends entirely on the trust people have in the issuing authority.

> inherently worthless

I think you are misusing this phrase. Gold has the second best electrical conductance, is virtually indestructible, highly biodegradable, a superb heat conductor, and almost impervious to the effects of water, air, and oxygen.

This is why it is heavily used in the aerospace industry and is vital to the exploration of Space.

I'll emphasize a prior point for clarity. Gold Can’t Be Destroyed, only Dissolved.

Those properties are not what people commonly think of when they attribute value to gold, though.

In particular, the idea that gold is a crisis-proof currency has nothing to do with these properties, but rather the notion that because its supply is limited (in contrast to fiat money) and it can't be reproduced or counterfeited, it will be accepted as a medium of exchange even in the most dire economic scenarios such as a complete societal collapse.

Perhaps paradoxically, though, in such a scenario gold has particularly little - if any - remaining economic value.

Gold is not "highly biodegradable". Are you confusing that with biocompatible?
No, I did not mean that although it is both.

References:

(Direct quote) How Gold Is Used In Aerospace: Gold Plating In Satellites https://www.valencesurfacetech.com/the-news/gold-plating-in-...

Biodegradable Gold Nanoclusters with Improved Excretion Due to pH-Triggered Hydrophobic-to-Hydrophilic Transition https://pubs.acs.org/doi/10.1021/jacs.9b13813

Gold nanoparticles: Synthesis properties and applications https://www.sciencedirect.com/science/article/pii/S101836472...

The innovation in the second link is an innovation precisely because gold is not biodegradable.
compressed copy of wikipedia
Of course the first answer, Wikipedia! I have the link!

(Offline Webpage Friendly Kiwix format) 102.62 GB uncompressed ZIM file format https://download.kiwix.org/zim/wikipedia_en_all_maxi.zim

Two problems with Wikipedia 1. It is absolutely full of unnecessary knowledge 2. It contains little knowledge of how to do anything

We'd be in big trouble! All the celebrity bios, corporate history, could probably half that to be essential. That's 50GB left!

Wikipedia is full of falsehoods