Hacker News new | ask | show | jobs
by santa_boy 2306 days ago
I'm using something similar I believe. I simply wrote a puppeteer automated browser to go through every page and saves it as `.mhtml` This work quite well for my purpose. I was archiving a site with content that I pay for and sits behind my login. I often use material from it when I'm offline and hence needed to put together this hack.

The below code does the job of saving the page as a single file.

```

        const page = await this.browser.newPage()
        const response = await page.goto(url, { timeout: 50000 })

        if (response.status() === 404) {
            await page.close()
            throw new Error('not found')
        }

        // credit: https://stackoverflow.com/questions/54814323/puppeteer-how-to-download-entire-web-page-for-offline-use
        const cdp = await page.target().createCDPSession();
        const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });

        const htmlFilename = "./data/" + slugify(url)+'.mhtml';
        fs.writeFileSync(htmlFilename, data);
```