| HN Mirror

Here's a slightly more detailed description: https://www.quora.com/What-is-the-algorithm-used-by-Diffbot-...

All identification and extraction in our APIs is based on our ML models, which have been fed hundreds of thousands of data-point examples from annotated web pages. Basically: our back end has reviewed millions of web pages to learn what various components of a page are -- and even what "type" of page a page is -- and uses that to make judgments on ones submitted via API.