Hacker News new | ask | show | jobs
by meiraleal 739 days ago
Seems like a very useful tool to impersonate websites. Useful to scammers. Why would someone crawl their own website?
3 comments

Scammers don't need this to copy an existing website, and I could see plenty of legitimate uses. Maybe you're redoing the website but want to keep the previous site around somewhere, or you want an easy way to archive a site for future reference. Maybe you're tired of paying for some hosted CMS but you want to keep the content.
All the scenarios you described can be achieved by having access to the source code, assuming you own it.
Lots of things are possible with access to source code that are still easier when someone writes a tool for that scenario.
Crawling a build you already have isn't one of them
The website in question may be a dynamic website (e.g., WordPress, MediaWiki, or whatever other CMS or custom web app) and you either want a snapshot of it for backup, or you run it locally and want un static copy to host it elsewhere that only support static files.
> Why would someone crawl their own website?

My main use case is that the docs site https://pota.quack.uy/ , Google cannot index it properly. On here https://www.google.com/search?q=site%3Apota.quack.uy you will see some tiles/descriptions won't match what the content of the page is about. As the full site is rendered client side, via JavaScript, I can just crawl myself and save the html output to actual files. Then, I can serve that content with nginx or any other web server without having to do the expensive thing of SSR via nodejs. Not to mention, that being able to do SSR with modern JavaScript frameworks is not trivial, and requires engineering time.

I’m not quite understanding: you’re saying you deploy your site one way, then crawl it, then redeploy it via the zipfile you created? And why is SSR relevant to the discussion?
Modern websites execute JavaScript that render DOM nodes that are displayed on the browser.

For example if you look at this site on the browser https://pota.quack.uy/ and do `curl https://pota.quack.uy/` do you see any of the text that is rendered in the browser as output of the curl command?

You don't, because curl doesn't execute JavaScript, and that text comes from JavaScript. One way to fix this problem, is by having a Node.js instance running that does SSR, so when your curl command connects to the server, a node instance executes JavaScript that is streamed/served to curl. (node is running a web server)

Another way, without having to execute JavaScript in the server is to crawl yourself, let's say in localhost, (you do not even need to deploy) then upload the result to a web server that could serve the files.

I want to take down a full copy of a site hosted on Squarespace before moving off of it.

I have no access to source and can't even republish the site directly without violating Squarespace's copyright.

But having the old site frozen in amber will be great for the redesign.

I think you can also screenshot full length in Chrome-based browsers; do both desktop & mobile widths.

It would be a good backup for the backup, & you designer will thank you.