| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by djhn 1204 days ago

I played around with this trying it on a few tricky cases. At least for the initial step of getting the data in a tidy format, it wasn't immediately obvious to me how crul could speed up my workflow.

1. Many JSON endpoints (especially not meant for public consumption) return a somewhat deeply nested list, and somewhere in that list are the individual items. Can your tool speed up (or automagically solve) getting just the itemns? I found that it just gives me thousands of columns (and truncates the results), where I would have wanted, say, two thousand rows of 23 columns. I couldn't wrangle the JSON within crul.

2. Many smaller sites still use wordpress, and may manually lay out items in a visual hierarchy that isn't immediately obvious in the structure of the HTML. Then you have to go in and parse every "row" and every "column" using xpath or css selectors. Crul wasn't particularly helpful for this either.

3. A scraping workflow requiring authentication, post request for searching, get request for picking the correct result, post request for downloading pdf and parsing said pdf... well, I couldn't get this to work at all.

Hope you'll end up solving such problems automagically and I end up paying you for it.

2 comments

portInit 1204 days ago

Appreciate you taking the time to play around with crul and share your thoughts! They're incredibly valuable.

1. Although it has limitations, were you able to try the normalize command? https://www.crul.com/docs/queryconcepts/api-normalization

2. Will need to think about this some more.

3. Although we don't yet handle pdfs, the rest of the flow is one we aiming to accomplish with crul. Other than pdf, the pieces should be there and would love to understand this further

link

djhn 1204 days ago

1: It does expand a level of hierarchy if you already know what you're looking for (from manually getting the data). Is there a way to omit columns that have more levels or keep them as list columns?

2: Probably too niche to be worth it for you :)

3: Parsing pdfs automagically isn't easy, but handling downloads and images and storing them in a bucket would go quite far (I guess there's a way to do that, but I didn't immediately see a "simple" example).

It sounds like it's potentially a great tool, the only open question is really, is it worth studying the docs and implementing a process in crul as opposed to whatever language I'm familiar with already?

link

RobotToaster 1204 days ago

Doesn't WordPress have a built in API?

link

djhn 1204 days ago

If you have a link to some high quality Wordpress API hacks/dorks for scraping, I'm all ears. I think my problem pages are usually made in some sort of page builder, like Elementor, and the content is a static soup of HTML like it came straight out of FrontPage2003.

link