Hacker News new | ask | show | jobs
by landric 1212 days ago
I _had_ no good answer for the Google News result until you prompted me to Inspect source just now...

I'm basically scanning for <a> tags and searching the text within. Doing a Google News inspect, it appears that their links actually have no text, but are sibling elements of an <h#> tag. So, I need to figure out how to parse that correctly...

1 comments

> Doing a Google News inspect, it appears that their links actually have no text, but are sibling elements of an <h#> tag. So, I need to figure out how to parse that correctly...

I just checked Google News myself, and you are correct that the sibling <h#> tag has the text. However, the <a> tag with the link has it too, but as a prop instead of being nested inside. Unless I am mistaken about the use case of that prop here, you can just extract the text from the aria-label property of the <a> tag.

And in case you want to proceed with parsing text from the sibling <h#> tag instead, you can just get the list of the parent <article> tag children nodes (yourAnchorTagNode.parentNode.parentNode.children; had to do a double .parentNode, because the <a> tag is wrapped in a singular <div> tag) and then search for the only <h#> tag there. That will be your target tag with the text.

Yep, that's right.

I was _hoping_ to get away with the same xml-parsing for each site, but I guess I'll need to customize

Practically speaking, you might actually sorta get away with it by using a single if-check, as long as you go with the aria-label approach instead of the <h#> sibling node search.

My logic is that it is very unlikely that another website will copy over the exact html layout of Google News, so the <h#> is only going to work there. But I bet that Google News is far from the only website that has the article title text inside the aria-label prop in the <a> tag.

So you can cover a heavy majority of websites you care about (if not all of them) by just checking both the inner text and (in case the inner text is absent) the aria-label prop. No need for any custom logic implemented just for Google News, as it would likely solve this issue for a lot of other sources.