Hacker News new | ask | show | jobs
by shubhamjain 1409 days ago
Unpopular opinion, but Bash/Shell Scripting. Seriously, it's probably the fastest way to get things done. For fetching, use cURL. Want to extract particular markup? Use pup[1]. Want to process csv? Use cskit[2]. Or JSON? Use jq[3]. Want to use DB? Use psql. Once you get the hang of shell scripting, you can create simple scrapers by wiring up these utilities in a matter of minutes.

The only thing I wish was present was better support for RegExes. Bash and most unix tools don't support PCRE which can severely limiting. Plus, sometimes you want to process text as a whole vs line-by-line.

I would also recommend Python's sh[4] module if Shell scripting isn't your cup of tea. You get best of both worlds: faster dev work with Bash utils, and a saner syntax.

[1]: https://github.com/ericchiang/pup

[2]: https://csvkit.readthedocs.io/en/latest/

[3]: https://stedolan.github.io/jq/

[4]: https://pypi.org/project/sh/

5 comments

My main qualms with bash as a scripting language are that its syntax is not only kind of bonkers (no judgement, I know it's an old tool) but also just crazily unsafe. I link to a few high-profile things whenever people ask me why my mantra is "the time to switch your script from bash to python is when you want to delete things".

>rm -rf /usr /lib/nvidia-current/xorg/xorg

https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...

>rm -rf "$STEAMROOT/"*

https://github.com/valvesoftware/steam-for-linux/issues/3671

It's just too easy to shoot your foot.

There are couple of flags you can use to mitigate the safety risks. `set -u`, for instance, will thrown an error if an unbound variable is used. I always start my scripts with

> set -euo pipefail

Here's a detail explaination of all the switches: https://gist.github.com/mohanpedala/1e2ff5661761d3abd0385e82....

I do agree though, it's not the best tool. But combining CLI utilities tends to be fast.

For things like regular expressions, it's useful to know that Python has a "-c" option which can be passed a multi-line string as part of a CLI pipeline. You can do something like this:

    curl 'https://news.ycombinator.com/' | python -c '
    import sys, re, json
    html = sys.stdin.read()
    r = re.compile("<a href=\"(.*)\"")
    print(json.dumps(r.findall(html), indent=2))
    '
This outputs JSON which you can then pipe to other tools.
This is great. perl also has one-liners [1] one can use, but I gave up dealing with perl's obscure syntax. This is much better.

[1]: http://novosial.org/perl/one-liner/

That's great! I didn't know -c supported multiline - I always just crammed it into one line with semicolons.
Yeah I was the same - I only figured out the multi line trick a few days ago https://til.simonwillison.net/aws/boto-command-line
My shot-scraper CLI tool was designed to support this kind of workflow but with a real headless browser inserted into the mix. Means you can do things like this:

    shot-scraper javascript \
      "https://news.ycombinator.com/from?site=simonwillison.net" "
    Array.from(document.querySelectorAll('.itemlist .athing')).map(el => {
      const title = el.querySelector('a.titlelink').innerText;
      const points = parseInt(el.nextSibling.querySelector('.score').innerText);
      const url = el.querySelector('a.titlelink').href;
      const dt = el.nextSibling.querySelector('.age').title;
      const submitter = el.nextSibling.querySelector('.hnuser').innerText;
      const commentsUrl = el.nextSibling.querySelector('.subtext a:last-child').href;
      const id = commentsUrl.split('?id=')[1];
      const numComments = parseInt(
        Array.from(
          el.nextSibling.querySelectorAll('.subtext a[href^=item]')
        ).slice(-1)[0].innerText.split()[0]
      ) || 0;
      return {id, title, url, dt, points, submitter, commentsUrl, numComments};
    })
    " | jq '. | map(.numComments) | add'
That example scrapes a page on Hacker News by running JavaScript inside headless Chromium, outputs the results as JSON to stdout, then pipes them into jq to add them up. It outputs "1274".

https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...

(Fun side note: I figured out the jq recipe I'm using in this example using GPT-3: https://til.simonwillison.net/gpt3/jq )

Badass as always Simon. I prefer my custom cloudflare killer chrome headless python code. But this is cool for quick things.
This comment got the same energy as this comment that dropbox could just be replaced by an FTP https://news.ycombinator.com/item?id=9224
Every time that comment is bought up, I can't help but feel that people are deliberately fishing for the comparison.
Also a fan. Usually I generate indexes/urls, and then just wget and scrape the content offline once all is downloaded.