| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shubhamjain 1409 days ago

Unpopular opinion, but Bash/Shell Scripting. Seriously, it's probably the fastest way to get things done. For fetching, use cURL. Want to extract particular markup? Use pup[1]. Want to process csv? Use cskit[2]. Or JSON? Use jq[3]. Want to use DB? Use psql. Once you get the hang of shell scripting, you can create simple scrapers by wiring up these utilities in a matter of minutes.

The only thing I wish was present was better support for RegExes. Bash and most unix tools don't support PCRE which can severely limiting. Plus, sometimes you want to process text as a whole vs line-by-line.

I would also recommend Python's sh[4] module if Shell scripting isn't your cup of tea. You get best of both worlds: faster dev work with Bash utils, and a saner syntax.

[1]: https://github.com/ericchiang/pup

[2]: https://csvkit.readthedocs.io/en/latest/

[3]: https://stedolan.github.io/jq/

[4]: https://pypi.org/project/sh/

5 comments

plainnoodles 1409 days ago

My main qualms with bash as a scripting language are that its syntax is not only kind of bonkers (no judgement, I know it's an old tool) but also just crazily unsafe. I link to a few high-profile things whenever people ask me why my mantra is "the time to switch your script from bash to python is when you want to delete things".

>rm -rf /usr /lib/nvidia-current/xorg/xorg

https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...

>rm -rf "$STEAMROOT/"*

https://github.com/valvesoftware/steam-for-linux/issues/3671

It's just too easy to shoot your foot.

link

shubhamjain 1409 days ago

There are couple of flags you can use to mitigate the safety risks. `set -u`, for instance, will thrown an error if an unbound variable is used. I always start my scripts with

> set -euo pipefail

Here's a detail explaination of all the switches: https://gist.github.com/mohanpedala/1e2ff5661761d3abd0385e82....

I do agree though, it's not the best tool. But combining CLI utilities tends to be fast.

link

simonw 1409 days ago

For things like regular expressions, it's useful to know that Python has a "-c" option which can be passed a multi-line string as part of a CLI pipeline. You can do something like this:

    curl 'https://news.ycombinator.com/' | python -c '
    import sys, re, json
    html = sys.stdin.read()
    r = re.compile("<a href=\"(.*)\"")
    print(json.dumps(r.findall(html), indent=2))
    '

This outputs JSON which you can then pipe to other tools.

link

shubhamjain 1409 days ago

This is great. perl also has one-liners [1] one can use, but I gave up dealing with perl's obscure syntax. This is much better.

[1]: http://novosial.org/perl/one-liner/

link

infinite8s 1409 days ago

That's great! I didn't know -c supported multiline - I always just crammed it into one line with semicolons.

link

simonw 1409 days ago

Yeah I was the same - I only figured out the multi line trick a few days ago https://til.simonwillison.net/aws/boto-command-line

link

simonw 1409 days ago

My shot-scraper CLI tool was designed to support this kind of workflow but with a real headless browser inserted into the mix. Means you can do things like this:

    shot-scraper javascript \
      "https://news.ycombinator.com/from?site=simonwillison.net" "
    Array.from(document.querySelectorAll('.itemlist .athing')).map(el => {
      const title = el.querySelector('a.titlelink').innerText;
      const points = parseInt(el.nextSibling.querySelector('.score').innerText);
      const url = el.querySelector('a.titlelink').href;
      const dt = el.nextSibling.querySelector('.age').title;
      const submitter = el.nextSibling.querySelector('.hnuser').innerText;
      const commentsUrl = el.nextSibling.querySelector('.subtext a:last-child').href;
      const id = commentsUrl.split('?id=')[1];
      const numComments = parseInt(
        Array.from(
          el.nextSibling.querySelectorAll('.subtext a[href^=item]')
        ).slice(-1)[0].innerText.split()[0]
      ) || 0;
      return {id, title, url, dt, points, submitter, commentsUrl, numComments};
    })
    " | jq '. | map(.numComments) | add'

That example scrapes a page on Hacker News by running JavaScript inside headless Chromium, outputs the results as JSON to stdout, then pipes them into jq to add them up. It outputs "1274".

https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...

(Fun side note: I figured out the jq recipe I'm using in this example using GPT-3: https://til.simonwillison.net/gpt3/jq )

link

jamescampbell 1408 days ago

Badass as always Simon. I prefer my custom cloudflare killer chrome headless python code. But this is cool for quick things.

link

Mrdarknezz 1409 days ago

This comment got the same energy as this comment that dropbox could just be replaced by an FTP https://news.ycombinator.com/item?id=9224

link

valarauko 1409 days ago

Every time that comment is bought up, I can't help but feel that people are deliberately fishing for the comparison.

link

hhthrowaway1230 1409 days ago

Also a fan. Usually I generate indexes/urls, and then just wget and scrape the content offline once all is downloaded.

link