I _had_ no good answer for the Google News result until you prompted me to Inspect source just now...
I'm basically scanning for <a> tags and searching the text within. Doing a Google News inspect, it appears that their links actually have no text, but are sibling elements of an <h#> tag. So, I need to figure out how to parse that correctly...
> Doing a Google News inspect, it appears that their links actually have no text, but are sibling elements of an <h#> tag. So, I need to figure out how to parse that correctly...
I just checked Google News myself, and you are correct that the sibling <h#> tag has the text. However, the <a> tag with the link has it too, but as a prop instead of being nested inside. Unless I am mistaken about the use case of that prop here, you can just extract the text from the aria-label property of the <a> tag.
And in case you want to proceed with parsing text from the sibling <h#> tag instead, you can just get the list of the parent <article> tag children nodes (yourAnchorTagNode.parentNode.parentNode.children; had to do a double .parentNode, because the <a> tag is wrapped in a singular <div> tag) and then search for the only <h#> tag there. That will be your target tag with the text.
Practically speaking, you might actually sorta get away with it by using a single if-check, as long as you go with the aria-label approach instead of the <h#> sibling node search.
My logic is that it is very unlikely that another website will copy over the exact html layout of Google News, so the <h#> is only going to work there. But I bet that Google News is far from the only website that has the article title text inside the aria-label prop in the <a> tag.
So you can cover a heavy majority of websites you care about (if not all of them) by just checking both the inner text and (in case the inner text is absent) the aria-label prop. No need for any custom logic implemented just for Google News, as it would likely solve this issue for a lot of other sources.
Calendars for a number of news sites and aggregators, showing, per their methodology, of when "elon" appeared on their front page. I have to say that a couple of their results seem suspect to me so I question the methodology
For one this article is now on the main page of hacker news and it reports four days since last Elon on hacker news :D Might just be that it hasn’t been on the front page long?
Also maybe you shouldn't be counting news aggregators like Google News? Its basically double counting since its already on some other site.