| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by imgabe 1961 days ago
	Another tip I've found extremely helpful for webscraping: check the <head> for <meta> tags or a <script type="application/ld+json"> tag that might already have the information you want collected neatly in one place. You may be able to save yourself a lot of time and grief.

4 comments

manojlds 1960 days ago

We built a library for extracting these data - https://github.com/indix/web-auto-extractor

link

JimDabell 1960 days ago

Also, if the site is based on WordPress, the API is often open for read-only access, so you can fetch richer information and you won’t have to parse the full HTML document to get the content in question.

link

liquorice 1960 days ago

Unfortunately this data is often inconsistent with the visual representation. For example, webshops often list their product as 'InStock' regardless of the actual stock status. Since products are in stock most of the time, you will not find out about this and thus likely extract wrong data in the future.

This was especially apparent when I tried to get my hands on some weights during the early corona pandemic. All webshops were out of stock, but in about 80% of them the schema markup indicated otherwise

link

tschiller 1960 days ago

That's definitely the easiest when it's there. In some cases the microdata will instead be embedded in the HTML tags in the body: https://schema.org/docs/gs.html

link