Hacker News new | ask | show | jobs
by idle_processor 3046 days ago
- A regex blacklist for URL structure would be very helpful.

- Acceptable title length seems a bit short. 70 characters seems closer to what we're allowed today than the old 60 (per http://www.bigleap.com/blog/5-tips-take-advantage-googles-ne...).

Might also be worth segmenting URLs with query parameters in them into a low priority batch to check later (or skip).

(E.g., when spidering a WordPress site, the crawler wastes time on .../article/?replytocom=* URLs. URL filtering solves this, somewhat, but it might require multiple passes to identify all of the problematic query strings.)