Hacker News new | ask | show | jobs
by rightbyte 2612 days ago
Is text non-spam growing exponentially? I have a hard time believing so.
2 comments

This of course depends on what you mean by 'information'. Lets say we have data points

ABCDEFGHIJKLMNOP

But depending on the URL you follow to get there you can get a page containing only some of the elements.

index.html?ACD

or

index.html?AP

or

index.html?GI

All different combinations return a page that could be weighted differently by an algormith and represent valid informational return data. To a person looking for the information set DE in one place, this is a valid web page. More so you can abstract the URL query variable away to www.webpage.com/DE. You can quickly run into a combinatorial explosion where even attempting to figure out if a small portion of returns is different would consume most of the energy in the visible universe.

True. A crawler need to differentiate generated content from "real" content somehow.

I.e. a service: www.thenumberinsanskrit.com/?q=1 that returns the queried number in Sanskrit, need to not be indexed (except the entry page) while: www.news.com/?article=major-jones-in-scandal-20190103 needs to be indexed.

Usually interesting pages are indexed on the site or linked somewhere on it, though.

>A crawler need to differentiate generated content from "real" content somehow.

"Somehow", aka using computing power and storing results, but that still turns into an explosion of computing time and data storage. I mean, what is the difference between the example I listed and Facebook's front page? They are both 'real' content in a generated format.

And a converse argument for your Sanskrit example is, what if I have the sanskrit number and don't know what it is? I put it in google and the site returns it as the number one.

> linked somewhere on it

And those links can all be generated by algorithms.

Anyway, back to your original statement. There is no 'real' content. Only data exists. Most content systems used on the internet allow this data to be combined and displayed in a multitude of different ways depending on the call method and attributes of the viewee. Many times these combinations of data can present novel value to the user. And with the future only presenting us more automated data collection and presentation methods, search engines have lost this battle.

> Is text non-spam growing exponentially? I have a hard time believing so.

Yes. The number of content-creating humans on this planet with access to the internet is still growing exponentially. Eventually it will level off, but for now, the Internet is growing faster than Google can index.