Hacker News new | ask | show | jobs
by andrethegiant 429 days ago
It’s a shame that the knee-jerk reaction has been to outright block these bots. I think in the future, websites will learn to serve pure markdown to these bots instead of blocking. That way, websites prevent bandwidth overages like in the article, while still informing LLMs about the services their website provides.

[disclaimer: I run https://pure.md, which helps websites shield from this traffic]

9 comments

>I think in the future, websites will learn to serve pure markdown to these bots instead of blocking. That way, websites prevent bandwidth overages like in the article, while still informing LLMs about the services their website provides.

Why?

There's no value to the website for a bot scraping all of their content and then reselling it with no credit or payment to the original author.

Unless you're selling something. If you have articles praising your product/service/person and "comparison" articles of the "top 10 X 2025" (your offering happens to be number one) you want the bots to find you.

The LLM SEO game has only just begun. Things will only go downwards from here

Or technical docs. For example:

https://bun.sh/llm.txt

I love that! That's one of my biggest pain points: wrong/outdated usage of dependencies.
OP in this case is by no means the original author. In this linked post, they mentioned they scrape third parties themselves. OP's bots might not be as sophisticated, but they're still "borrowing" others' content the same way.
ChatGPT and others have some sort of attribution, where they link to the original webpage. How or when they decide to attribute is unclear. But websites are starting to pay attention to GEO (generative engine optimization) so that their brand isn’t entirely ignored by ChatGPT and others.
I do agree that LLM-as-a-search is going to likely become more and more prevalent as inference gets cheaper and faster, and people don't too much care about 'minor' hallucinations.

What I don't see however is any way this new way of searching will give back. There is some handwaving argument about links, however the entire value prop of an llm is you DON'T need to go to the source content.

could have just left it as SEO and changed the S to "Slop"
Until these bots become good citizens (eg: respecting robots.txt), I will be serving them gzipped gibberish that decompresses to terabytes.

The ball is in their court. You don’t get to demand civility AFTER being a dick. You apologize and HOPE you’re forgiven.

What do you reckon, does OP in this post respect robots.txt or do they "borrow" content in a similar manner, without respecting such standards?
I thought the AI wars would be fought with bombs vs. bots not with ZIP bombs vs. bots.
> I think in the future, websites will learn to serve pure markdown to these bots instead of blocking.

What for? Why would I serve anything to these leeches?

Because you, in this case OP, also generates bot traffic to "borrow" content from other websites to serve to their own users. Ironic, no?
Ah. I didn't realize andrethegiant's website was meant to _serve content_ to the bots, instead of _shielding us_ from the bots.

I fell for a class "To Serve Man" situation.

I think you're a bit late to the game ;) I built and sold 2markdown last year, which was then copied by firecrawl/mendable. And then you also have jina reader. Also "compare with" in the footer does nothing.
Markdown over HTTPS reminds me a bit of the gemini protocol:

https://en.wikipedia.org/wiki/Gemini_(protocol)

If only there were some way for websites to serve information and provide interactivity in a machine readable format. Like some sort of application programming interface. You could even return different formats based on some sort of 4-letter code at the end of a URL like .html, .json, .xml, etc.

And what if there was some standard sort of way for robots to tell your site what they're trying to do with some sort of verb like GET, PUT, POST, DELETE etc. They could even use a standard way to name the resource they're trying to interact with. Like a universal resource finder of some kind. You could even use identifiers to be specific! Like /items/ gives you a list of items and /items/1.json gives you data about a specific item.

That would be so awesome. The future is amazing.

The only thing that would make that even more perfect would be if there was some way for the site owner to signal to prospective bots which parts of the site are open to the bots to visit. I know this seems really complicated, but I really think it could be expressed in a simple text file.
i dont know, robots.txt sounds too complicated for 2025
I would have worded this as "it sounds to simple for 2025".
Accept: and rel="alternate" were literally made for this
how would one serve them .txt instead?
Add a Cloudflare snippet / some other edge function, and transform the response to convert to plaintext
Cool globe graphic on that site :)
or you know, AI crawlers could behave and get all that without any extra work for everybody. What makes you think they'll suddenly respect your scheme?