|
|
|
|
|
by simonw
1415 days ago
|
|
My shot-scraper CLI tool was designed to support this kind of workflow but with a real headless browser inserted into the mix. Means you can do things like this: shot-scraper javascript \
"https://news.ycombinator.com/from?site=simonwillison.net" "
Array.from(document.querySelectorAll('.itemlist .athing')).map(el => {
const title = el.querySelector('a.titlelink').innerText;
const points = parseInt(el.nextSibling.querySelector('.score').innerText);
const url = el.querySelector('a.titlelink').href;
const dt = el.nextSibling.querySelector('.age').title;
const submitter = el.nextSibling.querySelector('.hnuser').innerText;
const commentsUrl = el.nextSibling.querySelector('.subtext a:last-child').href;
const id = commentsUrl.split('?id=')[1];
const numComments = parseInt(
Array.from(
el.nextSibling.querySelectorAll('.subtext a[href^=item]')
).slice(-1)[0].innerText.split()[0]
) || 0;
return {id, title, url, dt, points, submitter, commentsUrl, numComments};
})
" | jq '. | map(.numComments) | add'
That example scrapes a page on Hacker News by running JavaScript inside headless Chromium, outputs the results as JSON to stdout, then pipes them into jq to add them up. It outputs "1274".https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho... (Fun side note: I figured out the jq recipe I'm using in this example using GPT-3: https://til.simonwillison.net/gpt3/jq ) |
|