|
|
|
|
|
by gaelenh
4676 days ago
|
|
Short version: We had direct access to all the large news image providers and analyzed the images for topic and story identification. We used this data to dynamically generate hundreds of thousands of pages. Way too much data to cache to file (our storage cluster was many many terabytes). We used a CDN to cache the images. When a bot would scrape, it would hit all the old pages that fell out of the cache. Perhaps our fault for having such a large sitemap.xml. |
|