| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by johndavi 3478 days ago
	Here's a slightly more detailed description: https://www.quora.com/What-is-the-algorithm-used-by-Diffbot-... All identification and extraction in our APIs is based on our ML models, which have been fed hundreds of thousands of data-point examples from annotated web pages. Basically: our back end has reviewed millions of web pages to learn what various components of a page are -- and even what "type" of page a page is -- and uses that to make judgments on ones submitted via API.