| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by johndavi 3472 days ago
	We exclusively rely on ML for our core product at Diffbot: automatic data extraction from web pages (articles, products, images, discussion threads, more in the pipeline), cross-site data normalization, etc. It's interesting and challenging work, but a definite point of pride for us to be a profitable AI-powered entity.

2 comments

infinite8s 3472 days ago

Are you guys familiar with the DeepDive work from Christopher Re's group at Stanford?

link

LolWolf 3472 days ago

Or his company Lattice for that matter.

link

johndavi 3471 days ago

Yes to both!

link

suanmeiguo 3472 days ago

Oh interesting. I've used diffbot and never thought Diffbot relies on AI. Could you elaborate? I thought it's a simple crawling and parsing task but I might be naive on this.

link

johndavi 3471 days ago

Here's a slightly more detailed description: https://www.quora.com/What-is-the-algorithm-used-by-Diffbot-...

All identification and extraction in our APIs is based on our ML models, which have been fed hundreds of thousands of data-point examples from annotated web pages. Basically: our back end has reviewed millions of web pages to learn what various components of a page are -- and even what "type" of page a page is -- and uses that to make judgments on ones submitted via API.

link