| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jot 892 days ago

This is how I do it.

I send the URLs I want scraped to Urlbox[0] it renders the pages saves HTML (and screenshot and metadata) to my S3 bucket[1]. I get a webhook[2] when it's ready for me to process.

I prefer to use Ruby so Nokogiri[3] is the tool I use for scraping step.

This has been particularly useful when I've want to scrape some pages live from a web app and don't want to manage running Puppeteer or Playwright in production.

Disclosure: I work on Urlbox now but I also did this in the five years I was a customer before joining the team.

[0]: https://urlbox.com [1]: https://urlbox.com/s3 [2]: https://urlbox.com/webhooks [3]: https://nokogiri.org

1 comments

nkko 891 days ago

Does it save the whole page or just the viewport? Just checked the landing page it looks targeted to a specific case of saving “screenshots” and this is also obvious from limitations in the pricing page so it would be unfeasible for larger projects?

jot 891 days ago

Urlbox will save the whole page.

It's primarily purpose is to render screenshots full-page or limited to viewport or an element. To do that well as it does the HTML has to be rendered perfectly first.

It's not as cheap as other solutions but we have customers who render millions of pages per month with us. They value the accuracy and reliability that's come from over a decade of refinements to the service.

Larger projects can request preferential pricing based on the specifics of the kinds of pages they are rendering.