Hacker News new | ask | show | jobs
by mtbcoder 4172 days ago
Regarding the spam sites, in your RSS feed, you are publishing your full articles. More than likely, the scraper sites are pulling directly from these feeds, publishing quickly and getting Googlebot to see the content before it hits your site (thus receiving attribution). I would suggest:

1) Summaries only in RSS feeds. 2) Throttle the RSS feed back by several hours so that your latest article is not listed immediately. 3) Upon publishing, immediately link to the article via all of your social media outlets. 4) When internally linking within articles, use full URL paths and not relative. (If the spam sites are directly pulling your content and not cleaning up, you may be able to get a link back to your site from the scraped content.)

When publishing, timing is everything. Just my $0.02 based on my own experiences dealing with spam sites.

On a side note, even though we are in the age of HTML5, I would still suggest sticking with one H1 tag per page, if possible.

1 comments

This sucks. I'm not saying it's not the answer, but the fact that you have to castrate your feed because spam sites can actually get "SEO credit" for your content just sucks. I always loved RSS feeds that published the full text, because I could read whole articles without having to click through.

Semantic web could fix this a little by making it easier to scrape with the <article> tag, but publishing content is exactly what RSS was meant to do.

I wish Google would (if even possible) find a better way to fix this. In the same way that there's an actual argument against single page apps because "they can't be indexed" or "SEO, man." Discoverability shouldn't be holding back progress (in an ideal world, I know). Rather, indexing should adapt to new technology so that we can make a better web that's still discoverable by users.

I agree, but with sites that do not have much authority (aka PageRank), it's difficult to determine attribution when scraped content is coming online just as quickly as an original post. Googlebot will generally hit a site several times a day, but if it's hitting the spammer's site first or if the spammer site has more authority, it's a long uphill climb to get things turned around.

I should also point out that this is just one thing to consider amongst the other points already made by others.

I must admit, I'm not desperate enough for new readers to hamstring my own publishing to my existing readers - so summary RSS feed is out for me.

I'd consider delaying RSS publication but that's actually very awkward as it's an Octopress website (i.e. created and pushed as a set of static pages, including things like the rss feed).