Hacker News new | ask | show | jobs
by unlog 741 days ago
> Why would someone crawl their own website?

My main use case is that the docs site https://pota.quack.uy/ , Google cannot index it properly. On here https://www.google.com/search?q=site%3Apota.quack.uy you will see some tiles/descriptions won't match what the content of the page is about. As the full site is rendered client side, via JavaScript, I can just crawl myself and save the html output to actual files. Then, I can serve that content with nginx or any other web server without having to do the expensive thing of SSR via nodejs. Not to mention, that being able to do SSR with modern JavaScript frameworks is not trivial, and requires engineering time.

1 comments

I’m not quite understanding: you’re saying you deploy your site one way, then crawl it, then redeploy it via the zipfile you created? And why is SSR relevant to the discussion?
Modern websites execute JavaScript that render DOM nodes that are displayed on the browser.

For example if you look at this site on the browser https://pota.quack.uy/ and do `curl https://pota.quack.uy/` do you see any of the text that is rendered in the browser as output of the curl command?

You don't, because curl doesn't execute JavaScript, and that text comes from JavaScript. One way to fix this problem, is by having a Node.js instance running that does SSR, so when your curl command connects to the server, a node instance executes JavaScript that is streamed/served to curl. (node is running a web server)

Another way, without having to execute JavaScript in the server is to crawl yourself, let's say in localhost, (you do not even need to deploy) then upload the result to a web server that could serve the files.