Hacker News new | ask | show | jobs
by juujian 1182 days ago
I have done a lot of scraping in the past. Cookies are a pain, this is a really elegant solution. Of course the biggest problem is that everything interesting is hidden away behind JavaScript these days and then you have to resort to Selenium and the whole thing just spirals out of control. But I'm looking forward to giving this a shot for non-JavaScript content in the future.

edit: JavaScript not Java

5 comments

Do you mean JavaScript? I have never run into content hidden by Java, but many pages load content dynamically using JavaScript.

I have found it's quite easy to snoop on those JavaScript API requests using the Network tab of Chrome Devtools, then copy the network request as a curl command for bash scripts or as JavaScript for browser extensions.

> I have never run into content hidden by Java

Tongue in cheek: You'd never know - servers running Java code generating HTML pages have probably conditionally not-rendered many pieces of HTML that you've never come across in your browsing :)

Yeah, you can sometimes find the API or find data sent in JavaScript but not in prerendered HTML, which can save you the pain of headless scraping.
I do mean JavaScript. Not sure how many times I have made that mistake... And great advice, that sounds like a neat approach.
The term "everything interesting" is of course subjective. What is interesting to person A might not be interesting to person B. I never use Selenium and I generally have no problem acessing "everything interesting". The simplest example is reading and submitting HN comments. Presumably we all find this interesting enough. Javascript is neither required to read, vote nor submit to HN.

What if the phrase "everything interesting" was replaced with specific examples and questions. Something like, "I cannot access X without Javascript. How do I access X without using Javascript."

HN is in the minority of websites in that it works completely without JS. Surely you're aware of this, right?
1. Define "works".

2. Provide examples of sites that do not "work".

It's possible that people might disagree on the definition of "works". For example, perhaps web developers might be biased toward a definition that puts them in control instead of the user. If I can retrieve information from a server with HTTP requests then the website "works" for me. As a user, I certainly do not need to use Javascript to make HTTP requests. Nor do I need to use a particular client.

One could argue that even HN does not "work" completely without Javascript. For example, the script at https://news.ycombinator.com/hn.js will not run.

There are python libraries you can use that import cookies directly from wherever your browsers stores them to use in selenium projects.
Javascript is delivered as text and sends text-based HTTP calls to the server to fetch more data. Why do you need selenium?
if you don't want to reverse engineer the javascript
Most of the time you don't need to, just open up devtools, look at the network tab, locate the right request(s).
I've once used Selenium to run javascript in the webpage to steal a few dynamic tokens required by the sites API to reuse in my more well-trodden python-requests workflow.
If you'd be standing up CDP to grab the cookies, you'd probably use Puppeteer or Playwright instead of Selenium.
Appreciate the recommendation, I just used whatever python had to offer, Puppeteer looks promising though!
Using the tools at hand is often the best approach. That said, I've spent most of the last 13 years of my career automating browsers. For years, I used Selenium with a variety of libraries. After switching to Puppeteer/Playwright, I have zero interest in going back lol. Playwright actually has first party Python support. (Puppeteer has a port called Pyppeteer, but it's no longer maintained and the author recommends using Playwright)

https://playwright.dev/python/

I second Playwright, it's amazing.
Third.