| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chaoxu 2066 days ago

I'm using python and selectorlib( https://selectorlib.com/ ) for my work flow. Since most of the webpages I crawl can be broken down to:

- get to the webpage (selenium)

- do some clicks to expand certain information (selenium)

- save the html (selenium)

- and parse (selectorlib)

For me, almost everything can be done by css selectors or xpath. Selectorlib allows you to write just a tree of css selectors. The css selectors in the children only apply to currently selected objects.

The nice thing is the magical browser tool of the same name, which makes the first iteration much easier. However, the browser tool output and the python code does not always match, that causes some headaches. Overall, it cut down like 90% of the code and move it into a configuration.

1 comments

nsonha 2066 days ago

a lot of this selectorlib (get text or attributes) is achievable with xpath 1.0, which is built-in browsers and testing tools. What I do in my scrapping framework is that it takes a dict of name -> xpath and return a json object. This way the framework knows exactly what need to be etracted and stop loading the page as soon as all information are collected.

link