| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andrethegiant 429 days ago
	It’s a shame that the knee-jerk reaction has been to outright block these bots. I think in the future, websites will learn to serve pure markdown to these bots instead of blocking. That way, websites prevent bandwidth overages like in the article, while still informing LLMs about the services their website provides. [disclaimer: I run https://pure.md, which helps websites shield from this traffic]

9 comments

mtlynch 429 days ago

>I think in the future, websites will learn to serve pure markdown to these bots instead of blocking. That way, websites prevent bandwidth overages like in the article, while still informing LLMs about the services their website provides.

Why?

There's no value to the website for a bot scraping all of their content and then reselling it with no credit or payment to the original author.

link

wongarsu 429 days ago

Unless you're selling something. If you have articles praising your product/service/person and "comparison" articles of the "top 10 X 2025" (your offering happens to be number one) you want the bots to find you.

The LLM SEO game has only just begun. Things will only go downwards from here

link

sroussey 429 days ago

Or technical docs. For example:

https://bun.sh/llm.txt

link

RamblingCTO 429 days ago

I love that! That's one of my biggest pain points: wrong/outdated usage of dependencies.

link

randunel 429 days ago

OP in this case is by no means the original author. In this linked post, they mentioned they scrape third parties themselves. OP's bots might not be as sophisticated, but they're still "borrowing" others' content the same way.

link

andrethegiant 429 days ago

ChatGPT and others have some sort of attribution, where they link to the original webpage. How or when they decide to attribute is unclear. But websites are starting to pay attention to GEO (generative engine optimization) so that their brand isn’t entirely ignored by ChatGPT and others.

link

Incipient 429 days ago

I do agree that LLM-as-a-search is going to likely become more and more prevalent as inference gets cheaper and faster, and people don't too much care about 'minor' hallucinations.

What I don't see however is any way this new way of searching will give back. There is some handwaving argument about links, however the entire value prop of an llm is you DON'T need to go to the source content.

link

genewitch 429 days ago

could have just left it as SEO and changed the S to "Slop"

link

dmitrygr 429 days ago

Until these bots become good citizens (eg: respecting robots.txt), I will be serving them gzipped gibberish that decompresses to terabytes.

The ball is in their court. You don’t get to demand civility AFTER being a dick. You apologize and HOPE you’re forgiven.

link

randunel 429 days ago

What do you reckon, does OP in this post respect robots.txt or do they "borrow" content in a similar manner, without respecting such standards?

link

AlienRobot 429 days ago

I thought the AI wars would be fought with bombs vs. bots not with ZIP bombs vs. bots.

link

pavel_lishin 429 days ago

> I think in the future, websites will learn to serve pure markdown to these bots instead of blocking.

What for? Why would I serve anything to these leeches?

link

randunel 429 days ago

Because you, in this case OP, also generates bot traffic to "borrow" content from other websites to serve to their own users. Ironic, no?

link

pavel_lishin 428 days ago

Ah. I didn't realize andrethegiant's website was meant to _serve content_ to the bots, instead of _shielding us_ from the bots.

I fell for a class "To Serve Man" situation.

link

RamblingCTO 429 days ago

I think you're a bit late to the game ;) I built and sold 2markdown last year, which was then copied by firecrawl/mendable. And then you also have jina reader. Also "compare with" in the footer does nothing.

link

riffic 429 days ago

Markdown over HTTPS reminds me a bit of the gemini protocol:

https://en.wikipedia.org/wiki/Gemini_(protocol)

link

Swizec 429 days ago

If only there were some way for websites to serve information and provide interactivity in a machine readable format. Like some sort of application programming interface. You could even return different formats based on some sort of 4-letter code at the end of a URL like .html, .json, .xml, etc.

And what if there was some standard sort of way for robots to tell your site what they're trying to do with some sort of verb like GET, PUT, POST, DELETE etc. They could even use a standard way to name the resource they're trying to interact with. Like a universal resource finder of some kind. You could even use identifiers to be specific! Like /items/ gives you a list of items and /items/1.json gives you data about a specific item.

That would be so awesome. The future is amazing.

link

marcusb 429 days ago

The only thing that would make that even more perfect would be if there was some way for the site owner to signal to prospective bots which parts of the site are open to the bots to visit. I know this seems really complicated, but I really think it could be expressed in a simple text file.

link

tough 429 days ago

i dont know, robots.txt sounds too complicated for 2025

link

thwarted 429 days ago

I would have worded this as "it sounds to simple for 2025".

link

mubou 429 days ago

Accept: and rel="alternate" were literally made for this

link

tough 429 days ago

how would one serve them .txt instead?

link

andrethegiant 429 days ago

Add a Cloudflare snippet / some other edge function, and transform the response to convert to plaintext

link

happyzappy 429 days ago

Cool globe graphic on that site :)

link

detaro 429 days ago

or you know, AI crawlers could behave and get all that without any extra work for everybody. What makes you think they'll suddenly respect your scheme?

link