| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 694 days ago

I built a CLI tool (and Python library) for this a while ago called strip-tags: https://github.com/simonw/strip-tags

By default it will strip all HTML tags and return just the text:

    curl 'https://simonwillison.net/' | strip-tags

But you can also tell it you just want to get back the area of a page identified by one or more CSS selectors:

    curl 'https://simonwillison.net/' | strip-tags .quote

Or you can ask it to keep specific tags if you think those might help provide extra context to the LLM:

    curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote

Add "-m" to minify the output (basically stripping most whitespace)

Running this command:

    curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote -m

Gives me back output that starts like this:

    <div class="quote segment"> <blockquote>history | tail -n
    2000 | llm -s "Write aliases for my zshrc based on my
    terminal history. Only do this for most common features.
    Don't use any specific files or directories."</blockquote> —
    anjor  #
    3:01 pm
    / ai, generative-ai, llms, llm  </div>
    <div class="quote segment"> <blockquote>Art is notoriously
    hard to define, and so are the differences between good art
    and bad art. But let me offer a generalization: art is
    something that results from making a lot of choices. […] to
    oversimplify, we can imagine that a ten-thousand-word short
    story requires something on the order of ten thousand
    choices. When you give a generative-A.I. program a prompt,
    you are making very few choices; if you supply a hundred-word
    prompt, you have made on the order of a hundred choices. If
    an A.I. generates a ten-thousand-word story based on your
    prompt, it has to fill in for all of the choices that you are
    not making.</blockquote> — Ted Chiang  #
    10:09 pm
    / art, new-yorker, ai, generative-ai, ted-chiang  </div>

I also often use the https://r.jina.ai/ proxy - add a URL to that and it extracts the key content (using Puppeteer) and returns it converted to Markdown, e.g. https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato...