Hacker News new | ask | show | jobs
by marginalia_nu 1611 days ago
That would be a massive loss, though. A lot of content isn't in HTML5, and a lot of that pre-HTML5 content is precious and valuable.

Google has sadly already tossed a lot of that by the wayside, since it often isn't served with HTTPS. I think something like 80% of the sites my crawler is aware of serve pages over plain HTTP.

In general, attempts at shaping the web through search engine indexing requirements seems to mostly serve to filter out content made by humans and select for search engine marketing.

2 comments

Not so sure older content (like the stuff I wrote in the late 90s to mid 00s) would be negatively impacted, so long as search providers pay careful attention to the <!DOCTYPE> tag (or lack thereof). I wouldn't characterize holding people to at least a bare minimum of standards (e.g., title in the head and nowhere else, which has been the rule since at least HTML 2.0 in 1994) as "punishment", any more than dinging them for unclosed parens and other typos. Language is how we communicate understanding, and markup is how we frame presentations on the web (mostly). People need to be prepared for the consequences of making it up as they go along rather than educating themselves on the standard (whether spelling, grammar or markup language).
That really doesn't seem to be what I'm seeing, having built a search engine specialized in this type of content and finding almost nothing but gems in the refuse.

If anything, it seems like the single best predictor of whether a website is a content mill is strict adherence to modern web standards and other "google rules".

I think it'd be a pretty good to let in historical stuff on grace - and just start penalizing new content. Google absolutely has the tools to do this the right way and the internet archive could allow most other folks to accomplish the same thing.

Enabling HTTPS is easy on most platforms. Folks that have rolled their own platform or got unlucky and are using a CMS that fell out of favor do tend to get screwed over by this - but I think its fair to de-prioritize content that fails to adhere to good practices. The HTTP vs HTTPS debate in particular can be a real security concern - with tags its more about paying down the tech debt in our browser technology.