Hacker News new | ask | show | jobs
by motoko 6124 days ago
What is the tell between a spam site from an authority?
3 comments

In general terms:

Spam: thin site, lacks original content, few pages indexed, crawled infrequently, ranks for a small selection of keywords, weak backlink profile

Authority: many pages indexed, thick site, loads of original content, crawled frequently, ranks for a variety of keywords, diversified backlink profile

Page Rank used to be a strong indicator of this, but currently a better indicator is how often your site is crawled. That's why one authority link can do worlds more for you than hundreds (or thousands) of spammy/low quality links.

I guess it's all about how you define "thin," "few", "small," and "weak." A page on how to eradicate your home of carpenter ants would possibly look like all of these, yet may be the most "authoritative" site on the web on the small, niche subject.
If it is authoritative, then you would expect it would (eventually) get linked to, no?
Perhaps, but not necessarily that much. How many people would put a link to how to build a dog house on their web page? Anybody who would probably has links to other dog house things (buy a book on Amazon, keep it smelling fresh with this goo from WalMart) etc., diminishing your algorithmically determined authority.

Not everybody blogs about everything going on in his/her life.

A definition that errs on the side of marking spam sites as authorities is defining a spam site as one that exists solely for the purpose of selling links to other sites.

I would expect that these true spam sites exist as a link network somewhat seperate from the rest of the web. You could start manually defining filters based on what sites are actively selling links on the digital point forums.

We may have actually had this conversation in the real world, come to think of it.

thats easily a million dollar question; figure out how Google could flick a switch and clean their index with out noticeable false positives - I think they would hand you a check :)

* i would assume that the problem parallels with spam emails. i think webpages offer more clues to what the page is about than emails though.

I would argue that it is orders of magnitudes more complex.

Spam passess a single toll gate on the way to each user. You have an address book with contacts and previous conversations and a whole pile of data about previous 'known good' emails in the inbox.

Spam filtering has gotten pretty good. A lot better than google at filtering out 'bad' search results.