Hacker News new | ask | show | jobs
by Thorentis 1574 days ago
The Internet Archive has a huge noise to signal ratio, very much in favour of noise. I admire the effort and regularly make use of the quality archives. However, I wonder if much like Bitcoin, tremendous energy and amounts of resources are being put towards very little of value.
10 comments

I disagree, the unfiltered high noise is what makes it valuable. Curation is a bias.

If someone wants to dive into any topic in the archive 30 years from now they will have access to everything, not access to what some of us deem 'worthy' of curating.

I agree that it makes it harder to find things but I also see the value of IA as a time capsule.

Yes, curation is very valuable, but it needs to be a layer on top of an uncurated source.

I enjoy using Open Library to re-read obscure middle grades books from the 1950s-1990, and there are some obscure DOS games I want to revisit. It's hard to find what I want sometimes, but only having access to curated lists would change it from "hard" to "impossible" in many cases.

Tools to separate signal from noise will also get better in the future. You can imagine that in 100 years time, using a super duper AI search engine will perform far better than whatever some human decided to categorize stuff as today.
Storing data is cheap and gets cheaper all the time. This isn't a super comparison, but the Internet archive's 2019 revenue is listed as $36.7 mil on Wikipedia (https://en.m.wikipedia.org/wiki/Internet_Archive).

Hard to compare Bitcoin directly, but its market cap was around $1 billion in 2013 and cleared $1 trillion for the first time a little over a year ago.

I get that this article is about people using their personal computers to help archive things, but I don't think the Internet archive is ever going to be using resources even remotely as aggressively as cryptocurrencies unless they somehow turn all their archiving into cryptocurrency.

Value is really hard to predict, but as someone who researches a lot in archives, there is no such thing as too little information. Especially if you want the views of several parties or organizations. In anthropology and history research this work (archiving) can be of tremendous value.

Usually it’s hard to say if it’s valuable now , only time can tell.

Just to point this out, on a technical level, the internet archive has very (!) little overhead.

Crawled data is de-duplicated on the request level and response payloads can be individually gzipped as well as having per-archive-file compression. [1]

[1] https://www.iso.org/standard/68004.html

> tremendous energy and amounts of resources are being put towards very little of value.

I doubt it take tremendous energy or resources. What percentage of the overall internet energy/resources is used by IA? An insignificant minuscule amount.

The problem with IA is that they are constantly attacked by institutions, corporations, etc to remove content.

I think the real problem is a bit deeper: Unorganized raw data itself is of very low value, but it becomes much more valuable when humans process, categorize, and interpret it via a higher-level system of reason. We're doing a lot of the former but not the latter: we have so much data but have no idea what they all mean as a whole.

Libraries aren't just "a bunch of books piled up in shelves", they're a historical invention built and perfected for centuries where books are extensively coded and catalogued via a complex hierarchical system. As we are dealing with far more data than the past (not just books but posts and comments from all over the world, as well as new kinds of media such as images and videos), and also have new kinds of conceptual and technological inventions that previous librarians didn't have access to (hyperlinks, databases, graph theory, machine learning, etc.), the current status of data management begs for a major overhaul. (For example, the best we are currently doing for querying and searching from massive data is Google, and it is incredibly primitive! And even then we lament that the quality of it has decreased in favor of SEO-maximizing content.) So much raw data is created every day, and we just seem to fail to understand and interpret almost all of it, I see it as one of the major historical crises we face today. Instead of just storing data, we must find radical new methodologies and tools to search, filter, and explore data, and this poses as both a philosophical problem (of semiotics, linguistics, and hermeneutics) as well as a technological problem.

Can you provide some details on this? I'm curious how noise and signal are defined and measured in this case.
I disagree with the op. This is historical data and includes all kinds of interesting content. Even if severely uninteresting today it may still be really valuable 40 years from now as part of research into colloquial language, design, trends, influence of events etc.

Same reason why notes taken by random people 250 years ago are really valuable to historians today, even if it's just a todo list

I would argue that the archive.org and saving the legacy of the internet is a far more important use of energy than making up imaginary digital currency pyramid schemes.
I've been having fun with this post all day, but now I kind of need to know: Can you give examples of noise on the Archive?
Unlike the Archive, the "value" of Bitcoin can be measured: Today's market cap of BTC is $839.5B
>Today's market cap of BTC is $839.5B

Or Zero...depends who want to exchange it to real-stuff

Well, if you happen to have some bitcoins that you are willing to sell to me for less than their "market value" today, then please, get in touch with me...

The same goes for any other money/not-money's out there... If anyone has gold/silver/diamonds that he wants to get rid of for a price lower than the market value, then again, please get in touch with me....

If you think you can compare the trust people have to gold compared to bitcoin your in a massive bubble.