Hacker News new | ask | show | jobs
by marginalia_nu 1611 days ago
As a search engine developer I totally get why. HTML in the wild is not well behaved in the slightest. People use title and heading tags in all manner of weird ways. I've seen <title>-tags in the <body>-tag used as headings. I've seen documents where every line was a <h1>-tag.

You kinda need to make the most of what you're given.

1 comments

HTML5 has been around for long enough that we should be able to punish sites that use completely bonkers markup at this point right? Since Google effectively has historical archives of the internet they could pretty trivially grandfather in legitimately old content (things they tracked before some date) and just start down-ranking sites that continue to misbehave with markdown but skate by with browsers running in compatibility mode. Something like abusing <h1> tags is legal, if obnoxious, HTML and so it shouldn't really fall under this... but it's been long enough that we can start punishing completely incorrect syntax right?
That would be a massive loss, though. A lot of content isn't in HTML5, and a lot of that pre-HTML5 content is precious and valuable.

Google has sadly already tossed a lot of that by the wayside, since it often isn't served with HTTPS. I think something like 80% of the sites my crawler is aware of serve pages over plain HTTP.

In general, attempts at shaping the web through search engine indexing requirements seems to mostly serve to filter out content made by humans and select for search engine marketing.

Not so sure older content (like the stuff I wrote in the late 90s to mid 00s) would be negatively impacted, so long as search providers pay careful attention to the <!DOCTYPE> tag (or lack thereof). I wouldn't characterize holding people to at least a bare minimum of standards (e.g., title in the head and nowhere else, which has been the rule since at least HTML 2.0 in 1994) as "punishment", any more than dinging them for unclosed parens and other typos. Language is how we communicate understanding, and markup is how we frame presentations on the web (mostly). People need to be prepared for the consequences of making it up as they go along rather than educating themselves on the standard (whether spelling, grammar or markup language).
That really doesn't seem to be what I'm seeing, having built a search engine specialized in this type of content and finding almost nothing but gems in the refuse.

If anything, it seems like the single best predictor of whether a website is a content mill is strict adherence to modern web standards and other "google rules".

I think it'd be a pretty good to let in historical stuff on grace - and just start penalizing new content. Google absolutely has the tools to do this the right way and the internet archive could allow most other folks to accomplish the same thing.

Enabling HTTPS is easy on most platforms. Folks that have rolled their own platform or got unlucky and are using a CMS that fell out of favor do tend to get screwed over by this - but I think its fair to de-prioritize content that fails to adhere to good practices. The HTTP vs HTTPS debate in particular can be a real security concern - with tags its more about paying down the tech debt in our browser technology.

I really wish browsers would stop shrugging their shoulders at bad markup and display blank pages with errors in the consoles or even visible in the rendered page. It would force devs to clean up their act. But as long as 1 browser vendor doesn't do it, the end users will all just assume the strict browser is broken since there is another browser that does "work".