|
|
|
|
|
by kabacha
2149 days ago
|
|
> In general, my scraping methods usually ends up hand-crafted I've tried many of these xpath generators and even built few myself. There's still nothing that matches human built ones. Best selectors and most stable selectors are context aware. For example to get a comment text a human would build a css selector: `.article .comments-box .comment p::text` and there's no way without AI's involvement or some big-sample training for the generator to know this object relation structure. This becomes especially noticeable when parsing complex webpages that can be highly dynamic and with their html. While the tree structure is often unstable the core object relationship almost always is, in other words comment text will always be under comment paragraph, under comment box, under article. |
|