|
|
|
|
|
by 1vuio0pswjnm7
1205 days ago
|
|
Please share some examples of webpages with data to be wrangled that support this statement: "The reality is that shit is hard, doesn't scale (classic blocking for-loop or async saturation), and comes with thorny maintenance/security issues." Every web user's needs are different. One person might have a task that they struggle to accomplish while another might have one that presents no major challenges. As a web user, I transform web pages to CSV or SQL. I log HTTP and network requests. I do this for free using open source software. No web browser needed. No docker image needed. Works on both Linux and BSD. For me, the web is a dataset from which I retrieve data/information. "Tech" companies want to the web to be more like a video game, with visuals and constant interactivity. |
|
Related to the quote, we've seen interest for API data wrangling, where prepackaged data feeds can be cumbersome to edit, or other implementation details become challenging like credential management, domain throttling, scheduling, checkpointing, export, etc.
It's also interesting for webpage data when you need to use a particular page as an index of links to filter and crawl. We've tried to build an abstraction layer around that.
Initially, we were mainly focused on webpages, and wanted to bypass the visuals of the browser and use a headless browser to fulfill network requests, render js, etc. then convert the page to flat table of enriched elements. With the APIs as another data set there is some work for us to do around language.
We're now trying to figure out what workflows are most relevant for crul to optimize around, as well, honestly - we just built what we thought was cool. Some features/workflows will certainly be more straightforward with existing tools and software - especially for a technically savvy user.