|
|
|
|
|
by idle_processor
3046 days ago
|
|
- A regex blacklist for URL structure would be very helpful. - Acceptable title length seems a bit short. 70 characters seems closer to what we're allowed today than the old 60 (per http://www.bigleap.com/blog/5-tips-take-advantage-googles-ne...). Might also be worth segmenting URLs with query parameters in them into a low priority batch to check later (or skip). (E.g., when spidering a WordPress site, the crawler wastes time on .../article/?replytocom=* URLs. URL filtering solves this, somewhat, but it might require multiple passes to identify all of the problematic query strings.) |
|