|
|
|
|
|
by KMag
1316 days ago
|
|
If you worked on Google's crawl scheduling, HN would be one of the sites you used to test out ideas for better scheduling heuristics, right? I worked in indexing over a decade ago, but back then, after some basic constraints (per-IP rate limiting, don't re-check any page for updates too often, don't wait a crazy long time before re-checking any page, etc.), it was a bunch of arcane black magic heuristics to schedule pages for crawling. These days, I imagine they have one ML model for the expected time until a given page shows up on the first page of search results for some query, another ML model for guessing how much the page has changed (cosine distance of some semantic embedding or something), and schedule based on the product of the two estimates. It's still probably lots of black magic heuristics, just now it's probably heuristics nobody can read. |
|