| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 1415 days ago

My shot-scraper CLI tool was designed to support this kind of workflow but with a real headless browser inserted into the mix. Means you can do things like this:

    shot-scraper javascript \
      "https://news.ycombinator.com/from?site=simonwillison.net" "
    Array.from(document.querySelectorAll('.itemlist .athing')).map(el => {
      const title = el.querySelector('a.titlelink').innerText;
      const points = parseInt(el.nextSibling.querySelector('.score').innerText);
      const url = el.querySelector('a.titlelink').href;
      const dt = el.nextSibling.querySelector('.age').title;
      const submitter = el.nextSibling.querySelector('.hnuser').innerText;
      const commentsUrl = el.nextSibling.querySelector('.subtext a:last-child').href;
      const id = commentsUrl.split('?id=')[1];
      const numComments = parseInt(
        Array.from(
          el.nextSibling.querySelectorAll('.subtext a[href^=item]')
        ).slice(-1)[0].innerText.split()[0]
      ) || 0;
      return {id, title, url, dt, points, submitter, commentsUrl, numComments};
    })
    " | jq '. | map(.numComments) | add'

That example scrapes a page on Hacker News by running JavaScript inside headless Chromium, outputs the results as JSON to stdout, then pipes them into jq to add them up. It outputs "1274".

https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...

(Fun side note: I figured out the jq recipe I'm using in this example using GPT-3: https://til.simonwillison.net/gpt3/jq )

1 comments

jamescampbell 1414 days ago

Badass as always Simon. I prefer my custom cloudflare killer chrome headless python code. But this is cool for quick things.

link