| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by _xnmw 2471 days ago
	A productivized scraping service - useful! Entire companies are built around scraping certain popular sites - this is a disruptive idea indeed. A growing catalog of up-to-date scrapers for popular websites would put of lot of freelancers out of work. I would invest in this. However, the ML claim is highly suspect. There is no way that a machine could reliably understand the semantic content of a website - that would require Artificial General Intelligence. If anyone could do that, it would've been Google. But even Google relies on human-edited structured metadata to define the content of sites (i.e. Rich Snippets and the like).

3 comments

HPouillot 2471 days ago

It doesn't require Artificial General Intelligence, with enough training data (crowdsourced data and human-edited metadata like JSON-LD or RFD), we can classify automatically the attributes on the page (product name, movie title, creation date, author), structure them and recognise the type of entity.

Feel free to contact us if you want to invest (hello@dashblock.com), we are currently raising funds ;)

link

rvnx 2471 days ago

But, what's the value compared to using open-source products like Portia https://portia.readthedocs.io/en/latest/getting-started.html ? Functionally it looks very similar.

link

r0rshrk 2471 days ago

I'm sure this comment will go down in history like the Dropbox comment

link

rvnx 2469 days ago

Fine, I prefer to loose my comment than my invested money

link

jakubbalada 2471 days ago

> A growing catalog of up-to-date scrapers for popular websites would put of lot of freelancers out of work. I would invest in this.

Check out Apify store (https://apify.com/store). It's built exactly for that purpose.

(Disclaimer: I'm a co-founder of Apify)

link

enos_feedler 2470 days ago

Duplex for web [1] would certainly benefit from this kind of understanding so I wouldn't be surprised if Google is working on this.

1. https://www.theverge.com/2019/5/7/18531195/google-duplex-web...

link