|
|
|
|
|
by djhn
1204 days ago
|
|
I played around with this trying it on a few tricky cases. At least for the initial step of getting the data in a tidy format, it wasn't immediately obvious to me how crul could speed up my workflow. 1. Many JSON endpoints (especially not meant for public consumption) return a somewhat deeply nested list, and somewhere in that list are the individual items. Can your tool speed up (or automagically solve) getting just the itemns? I found that it just gives me thousands of columns (and truncates the results), where I would have wanted, say, two thousand rows of 23 columns. I couldn't wrangle the JSON within crul. 2. Many smaller sites still use wordpress, and may manually lay out items in a visual hierarchy that isn't immediately obvious in the structure of the HTML. Then you have to go in and parse every "row" and every "column" using xpath or css selectors. Crul wasn't particularly helpful for this either. 3. A scraping workflow requiring authentication, post request for searching, get request for picking the correct result, post request for downloading pdf and parsing said pdf... well, I couldn't get this to work at all. Hope you'll end up solving such problems automagically and I end up paying you for it. |
|
1. Although it has limitations, were you able to try the normalize command? https://www.crul.com/docs/queryconcepts/api-normalization
2. Will need to think about this some more.
3. Although we don't yet handle pdfs, the rest of the flow is one we aiming to accomplish with crul. Other than pdf, the pieces should be there and would love to understand this further