Hacker News new | ask | show | jobs
by ptasker 3036 days ago
Pretty cool, but I recommend anyone wanting to do this kind of thing to check out the source Puppeteer library. You can do some really powerful stuff and make a custom crawler fairly easily.

https://github.com/GoogleChrome/puppeteer

4 comments

Looks like this is actually built on top of puppeteer. See the "Note" under "Installation": https://github.com/yujiosaka/headless-chrome-crawler/blob/ma...
Puppeteer has some limitations. You can’t install extensions, for example.

I haven’t looked into it, but I imagine it has a pretty clear fingerprint as well. So it would be easier to block than stock chrome in headless mode.

Unless something has changed that I missed, you can install extensions (I complained when the default args messed this up [0]). For example, I built something that uses puppeteer and an extension to capture audio and video of a tab [1]. It's just headless mode that doesn't allow extensions [2] (which I now realize is probably what you meant).

0 - https://github.com/GoogleChrome/puppeteer/issues/850 1 - https://github.com/cretz/chrome-screen-rec-poc/tree/master/a... 2 - https://bugs.chromium.org/p/chromium/issues/detail?id=706008

Puppeteer seems needlessly difficult to use on a VPS. I'd prefer an easily dockerized version but there seems to be nothing robust and they make it VERY hard to connect to a docker instance just running Chrome for the websocket/9222 interface sadly.
I recently did this in a Docker.

Let me quickly add instructions here, first you need to install some dependancies, add the following to dockerfile:

  RUN apt-get install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget
Secondly, launch puppeteer with --no-sandbox option:

  browser = await puppeteer.launch({
      args: ['--no-sandbox'] /*, headless: false*/
    })
That should do it.
I've done this recently actually. Take a look at the yukinying/chrome-headless-browser[0] image. You'll need to run with the SYS_ADMIN capability and up the shm_size to 1024M (you can workaround the SYS_ADMIN cap with a seccomp file but I didn't have much luck with that). Other than that oddness it works pretty well (and with Puppeteer 1.0, with far fewer crashes).

[0]: https://github.com/yukinying/chrome-headless-browser-docker

Yeah I’d really rather that people made extensions to Pupeteer rather than a whole new library.