Hacker News new | ask | show | jobs
by imgabe 1961 days ago
Another tip I've found extremely helpful for webscraping: check the <head> for <meta> tags or a <script type="application/ld+json"> tag that might already have the information you want collected neatly in one place. You may be able to save yourself a lot of time and grief.
4 comments

We built a library for extracting these data - https://github.com/indix/web-auto-extractor
Also, if the site is based on WordPress, the API is often open for read-only access, so you can fetch richer information and you won’t have to parse the full HTML document to get the content in question.
Unfortunately this data is often inconsistent with the visual representation. For example, webshops often list their product as 'InStock' regardless of the actual stock status. Since products are in stock most of the time, you will not find out about this and thus likely extract wrong data in the future.

This was especially apparent when I tried to get my hands on some weights during the early corona pandemic. All webshops were out of stock, but in about 80% of them the schema markup indicated otherwise

That's definitely the easiest when it's there. In some cases the microdata will instead be embedded in the HTML tags in the body: https://schema.org/docs/gs.html