Hacker News new | ask | show | jobs
by johndavi 3472 days ago
We exclusively rely on ML for our core product at Diffbot: automatic data extraction from web pages (articles, products, images, discussion threads, more in the pipeline), cross-site data normalization, etc. It's interesting and challenging work, but a definite point of pride for us to be a profitable AI-powered entity.
2 comments

Are you guys familiar with the DeepDive work from Christopher Re's group at Stanford?
Or his company Lattice for that matter.
Yes to both!
Oh interesting. I've used diffbot and never thought Diffbot relies on AI. Could you elaborate? I thought it's a simple crawling and parsing task but I might be naive on this.
Here's a slightly more detailed description: https://www.quora.com/What-is-the-algorithm-used-by-Diffbot-...

All identification and extraction in our APIs is based on our ML models, which have been fed hundreds of thousands of data-point examples from annotated web pages. Basically: our back end has reviewed millions of web pages to learn what various components of a page are -- and even what "type" of page a page is -- and uses that to make judgments on ones submitted via API.