Hacker News new | ask | show | jobs
by sugarfactory 3632 days ago
Extracting machine-understandable meaning from web pages is much analogous to extracting text from images.

Fortunately, we usually don't need to process web pages using fancy yet hardly accurate algorithms in order to extract machine-readable text from web pages. Why? It's because we agreed to use character codes to codify letters and most of the time text is encoded using some character code, which makes it unnecessary to OCR pictures of hand-written letters to programatically process text from web pages.

These kinds of program wouldn't be needed if only the same thing had happened for page structures, if HTTP included page semantics.

4 comments

The issue with creating tech for such semantics is whether authors put in the effort to provide metadata. For example, rel=next/previous has been around forever but most webpages don't have them because they are not exposed in browsers or other clients. Other data mentioned in the examples like title and open graph tags are provided for search engines, Facebook previews, and such.
But is Fathom supposed to be for annotating your own web site, or analyzing others'? If the former, then I'm truly bewildered, but it's not clear which it is.

As for rel=next/previous, I don't use them because Google makes it clear that these will make it treat the whole sequence like one paginated document in the index, contrary to their original semantics. I'd love if someone could correct me on this.

[0] http://webmasters.stackexchange.com/questions/61573/is-it-ju...

[1] https://www.w3.org/TR/html401/struct/links.html#h-12.3

[2] https://webmasters.googleblog.com/2011/09/pagination-with-re...

I suspect it's against the website's interests. If you provide semantic marks, it makes it easier to crawl your website, extract the actual content, and leave the ads behind.
The <article> tag already makes it pretty easy.
Well, yes, only then everyone realized it takes you another good 2-3x of the work over and above writing text to put it into a form which is "machine-understandable" with a whole bunch of metadata and requires the people writing the text to be familiar with all of that and have a good idea what is "machine-understandable" and what is not, for..

zero gain. There is just no application. Google works perfectly fine. Give it up already.

And just to take this to another tangent.. Word won the office space. There are no normal people writing LaTeX for their party invitation. And hell, even the people writing LaTeX don't want to bother with the "machine-understandable" thing so they went ahead and made it into a proper Turing complete programming language.

I think the solution is smarter document editors. Auto correct / suggest with context awareness, database integrations, machine learning / basic AI, and so on.

Even simple stuff like asking the writer to clarify which piece of a text that a written date applies to would be helpful, and to define what parts of long sentences with many commas belong together (which subsentences are interjections, for example?).

HTTP does/can include a lot in the way of semantics. It's entirely down to how the developer decides to write their markup.
You mean HTML, not HTTP, no?

Otherwise I'd be curious to hear how or why including web page layout semantics in HTTP would be useful.

I meant HTTP but I have to admit it was not clear. What I had in my mind when I wrote that was there needs to be something that forces people to provide semantics of web pages. By which I mean HTTP is too liberal, allowing any kind of document including documents without semantic annotations to be transferred. Therefore I thought people would provide page semantics if HTTP required documents be annotated with semantics just like HTTP requires Content-Type be given.
Ah, I see, that makes sense. The protocol level does seem like a good way to enforce it. Though I do have to wonder if, had that enforcement been put in place, people would have moved away from HTTP and toward a different, looser protocol on top of TCP. Or maybe that wouldn't have been practical. It's interesting to think how early, seemingly low level decisions about protocol design can have a profound effect on how things develop down the road.