Hacker News new | ask | show | jobs
by _xnmw 2471 days ago
A productivized scraping service - useful! Entire companies are built around scraping certain popular sites - this is a disruptive idea indeed. A growing catalog of up-to-date scrapers for popular websites would put of lot of freelancers out of work. I would invest in this.

However, the ML claim is highly suspect. There is no way that a machine could reliably understand the semantic content of a website - that would require Artificial General Intelligence. If anyone could do that, it would've been Google. But even Google relies on human-edited structured metadata to define the content of sites (i.e. Rich Snippets and the like).

3 comments

It doesn't require Artificial General Intelligence, with enough training data (crowdsourced data and human-edited metadata like JSON-LD or RFD), we can classify automatically the attributes on the page (product name, movie title, creation date, author), structure them and recognise the type of entity.

Feel free to contact us if you want to invest (hello@dashblock.com), we are currently raising funds ;)

But, what's the value compared to using open-source products like Portia https://portia.readthedocs.io/en/latest/getting-started.html ? Functionally it looks very similar.
I'm sure this comment will go down in history like the Dropbox comment
Fine, I prefer to loose my comment than my invested money
> A growing catalog of up-to-date scrapers for popular websites would put of lot of freelancers out of work. I would invest in this.

Check out Apify store (https://apify.com/store). It's built exactly for that purpose.

(Disclaimer: I'm a co-founder of Apify)

Duplex for web [1] would certainly benefit from this kind of understanding so I wouldn't be surprised if Google is working on this.

1. https://www.theverge.com/2019/5/7/18531195/google-duplex-web...