| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by juujian 1182 days ago
	I have done a lot of scraping in the past. Cookies are a pain, this is a really elegant solution. Of course the biggest problem is that everything interesting is hidden away behind JavaScript these days and then you have to resort to Selenium and the whole thing just spirals out of control. But I'm looking forward to giving this a shot for non-JavaScript content in the future. edit: JavaScript not Java

5 comments

mkl 1182 days ago

Do you mean JavaScript? I have never run into content hidden by Java, but many pages load content dynamically using JavaScript.

I have found it's quite easy to snoop on those JavaScript API requests using the Network tab of Chrome Devtools, then copy the network request as a curl command for bash scripts or as JavaScript for browser extensions.

link

tomashubelbauer 1182 days ago

> I have never run into content hidden by Java

Tongue in cheek: You'd never know - servers running Java code generating HTML pages have probably conditionally not-rendered many pieces of HTML that you've never come across in your browsing :)

link

ghqst 1182 days ago

Yeah, you can sometimes find the API or find data sent in JavaScript but not in prerendered HTML, which can save you the pain of headless scraping.

link

juujian 1182 days ago

I do mean JavaScript. Not sure how many times I have made that mistake... And great advice, that sounds like a neat approach.

link

1vuio0pswjnm7 1181 days ago

The term "everything interesting" is of course subjective. What is interesting to person A might not be interesting to person B. I never use Selenium and I generally have no problem acessing "everything interesting". The simplest example is reading and submitting HN comments. Presumably we all find this interesting enough. Javascript is neither required to read, vote nor submit to HN.

What if the phrase "everything interesting" was replaced with specific examples and questions. Something like, "I cannot access X without Javascript. How do I access X without using Javascript."

link

RockRobotRock 1181 days ago

HN is in the minority of websites in that it works completely without JS. Surely you're aware of this, right?

link

1vuio0pswjnm7 1181 days ago

1. Define "works".

2. Provide examples of sites that do not "work".

It's possible that people might disagree on the definition of "works". For example, perhaps web developers might be biased toward a definition that puts them in control instead of the user. If I can retrieve information from a server with HTTP requests then the website "works" for me. As a user, I certainly do not need to use Javascript to make HTTP requests. Nor do I need to use a particular client.

One could argue that even HN does not "work" completely without Javascript. For example, the script at https://news.ycombinator.com/hn.js will not run.

link

totetsu 1182 days ago

There are python libraries you can use that import cookies directly from wherever your browsers stores them to use in selenium projects.

link

berkle4455 1182 days ago

Javascript is delivered as text and sends text-based HTTP calls to the server to fetch more data. Why do you need selenium?

link

LelouBil 1181 days ago

if you don't want to reverse engineer the javascript

link

KomoD 1181 days ago

Most of the time you don't need to, just open up devtools, look at the network tab, locate the right request(s).

link

rhd 1182 days ago

I've once used Selenium to run javascript in the webpage to steal a few dynamic tokens required by the sites API to reuse in my more well-trodden python-requests workflow.

link

bdcravens 1182 days ago

If you'd be standing up CDP to grab the cookies, you'd probably use Puppeteer or Playwright instead of Selenium.

link

juujian 1182 days ago

Appreciate the recommendation, I just used whatever python had to offer, Puppeteer looks promising though!

link

bdcravens 1182 days ago

Using the tools at hand is often the best approach. That said, I've spent most of the last 13 years of my career automating browsers. For years, I used Selenium with a variety of libraries. After switching to Puppeteer/Playwright, I have zero interest in going back lol. Playwright actually has first party Python support. (Puppeteer has a port called Pyppeteer, but it's no longer maintained and the author recommends using Playwright)

https://playwright.dev/python/

link

rgrieselhuber 1181 days ago

I second Playwright, it's amazing.

link

robertlagrant 1181 days ago

Third.

link