Hacker News new | ask | show | jobs
by ajsfoux234 1648 days ago
Where does ArchiveTeam find all the reddit posts and comments to archive? Do they have a script automatically going through the "New" section or are they finding posts through Google or link crawling?
2 comments

Besides their Archive Warrior distributed crawler I imagine PushShift[0] is probably a starting point for them.

[0] https://files.pushshift.io/

In general, ArchiveTeam has scripts which hit random links to see if there is any content. They have coordination servers which share info on which slugs have been checked before to avoid duplicate effort.