|
|
|
|
|
by radium3d
106 days ago
|
|
Instead of "should have been an email" this is "should have been a prompt" and can be run locally instead. There are a number of ways to do this from a linux terminal. ```
write a custom crawler that will crawl every page on a site (internal links to the original domain only, scroll down to mimic a human, and save the output as a WebP screenshot, HTML, Markdown, and structured JSON. Make it designed to run locally in a terminal on a linux machine using headless Google Chrome and take advantage of multiple cores to run multiple pages simultaneously while keeping in mind that it might have to throttle if the server gets hit too fast from the same IP.
``` Might use available open source software such as python, playwright, beautifulsoup4, pillow, aiofiles, trafilatura |
|
You'll still be hand-rolling it if you want to disrespect crawling requirements though.