|
|
|
|
|
by simonw
647 days ago
|
|
I built a CLI tool (and Python library) for this a while ago called strip-tags: https://github.com/simonw/strip-tags By default it will strip all HTML tags and return just the text: curl 'https://simonwillison.net/' | strip-tags
But you can also tell it you just want to get back the area of a page identified by one or more CSS selectors: curl 'https://simonwillison.net/' | strip-tags .quote
Or you can ask it to keep specific tags if you think those might help provide extra context to the LLM: curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote
Add "-m" to minify the output (basically stripping most whitespace)Running this command: curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote -m
Gives me back output that starts like this: <div class="quote segment"> <blockquote>history | tail -n
2000 | llm -s "Write aliases for my zshrc based on my
terminal history. Only do this for most common features.
Don't use any specific files or directories."</blockquote> —
anjor #
3:01 pm
/ ai, generative-ai, llms, llm </div>
<div class="quote segment"> <blockquote>Art is notoriously
hard to define, and so are the differences between good art
and bad art. But let me offer a generalization: art is
something that results from making a lot of choices. […] to
oversimplify, we can imagine that a ten-thousand-word short
story requires something on the order of ten thousand
choices. When you give a generative-A.I. program a prompt,
you are making very few choices; if you supply a hundred-word
prompt, you have made on the order of a hundred choices. If
an A.I. generates a ten-thousand-word story based on your
prompt, it has to fill in for all of the choices that you are
not making.</blockquote> — Ted Chiang #
10:09 pm
/ art, new-yorker, ai, generative-ai, ted-chiang </div>
I also often use the https://r.jina.ai/ proxy - add a URL to that and it extracts the key content (using Puppeteer) and returns it converted to Markdown, e.g. https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato... |
|