| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by johannes1234321 438 days ago

While the dump may be simpler to consume, building it isn't simpler.

The generic web crawler works (more or less) everywhere. The Wikipedia dump solution works on Wikipedia dumps.

Also mind: This is tied in with search engines and other places, where the AI bot follows links from search results etc. thus they'd need the extra logic to detect a Wikipedia link, then find the matching article in the dump, and then add the original link back as reference for the source.

Also in one article on that I read about spikes around death from people etc. in that scenario they want the latest version of the article, not a day old dump.

So yeah, I guess they used the simple straight forward way and didn't care much about consequences.

1 comments

diggan 438 days ago

I'm not sure this is what is currently affecting them the most, the article mentions this:

> Since AI crawlers tend to bulk read pages, they access obscure pages that have to be served from the core data center.

So it doesn't seem to be driven by "Search the web for keywords, follow links, slurp content" but trying to read a bulk of pages all together, then move on to another set of bulk pages, suggesting mass-ingestion, not just acting as a user-agent for an actual user.

But maybe I'm reading too much into the specifics of the article, I don't have any particular internal insights to the problem they're facing I'll confess.