| Just for context, the author of the second link in your comment verifiably lied about blocking crawlers via robots.txt CommonCrawl archives robots.txt For convenience, you can view the extracted data here: https://pastebin.com/VSHMTThJ You are welcome to verify for yourself by searching for “wiki.diasporafoundation.org/robots.txt” in the CommonCrawl index here: https://index.commoncrawl.org/ The index contains a file name that you can append to the CommonCrawl url to download the archive and view.
More detailed information on downloading archives here: https://commoncrawl.org/get-started From September to December, the robots.txt at wiki.diasporafoundation.org contained this, and only this: >User-agent: *
>Disallow: /w/ |