|
|
|
|
|
by jot
845 days ago
|
|
This is how I do it. I send the URLs I want scraped to Urlbox[0] it renders the pages saves HTML (and screenshot and metadata) to my S3 bucket[1]. I get a webhook[2] when it's ready for me to process. I prefer to use Ruby so Nokogiri[3] is the tool I use for scraping step. This has been particularly useful when I've want to scrape some pages live from a web app and don't want to manage running Puppeteer or Playwright in production. Disclosure: I work on Urlbox now but I also did this in the five years I was a customer before joining the team. [0]: https://urlbox.com
[1]: https://urlbox.com/s3
[2]: https://urlbox.com/webhooks
[3]: https://nokogiri.org |
|