Hacker News new | ask | show | jobs
by Jarred 2059 days ago
For websites that use React, my favorite trick is loading a copy of React Developer Tools inside a headless Chrome instance.

From there, you just find the component you want to copy data from and you copy the state or props. Very little string parsing or data formatting required, no malformed data, etc. There's a library floating around on GitHub somewhere that makes loading a simplified version of React Developer Tools inside Puppeteer just a script you eval with a jQuery-like API for selecting React components, but I can't remember the name right now.

Someone could probably do this without needing a headless web browser (via jsdom)

2 comments

Doesn't most/all react data come from xhr? Can't you just figure out how the xhr works, and simply parse that?

I did this with an investment website, where I was able to retrieve all data using simple python. It _should_ be more robust than parsing react components/html.

> Doesn't most/all react data come from xhr? Can't you just figure out how the xhr works, and simply parse that?

Content-heavy websites using React often generate static versions of pages at build time (using e.g. https://nextjs.org/docs/advanced-features/automatic-static-o...). In those cases, there might not be a public API endpoint to fetch the data you want

For applications though, it's definitely easier to just make an HTTP request if you can. However, you're more likely to run into issues like APIs blocking datacenter IPs, rate limiting etc than when it appears you're just loading the website like a human

I'd add in Postman into that workflow, especially if there's headers you need to know about which are non-obvious in the xhr url. From the network tab of your browser's debugger, copy the network request as cURL, paste the cURL into Postman's import, and then click the "code" button to translate to python (or whatever else) code.
Could you explain a bit more about how you run the React DevTools in a headless Chrome? As far as I know, headless Chrome can't run extensions.
I don't precisely mean React Developer Tools because the UI is unnecessary for this usecase, but it provides similar functionality where you can access the state/props from the component instance.

The library is: https://github.com/baruchvlz/resq

Example code:

    // resq is the stringified source of the library
    // page is a Puppeteer page
    // this line injects resq into the page
    await page.evaluate(resq);
    // This finds a React component with a prop "country" set to "us"
    const usProps = await page.evaluate(
      `window["resq"].resq$("*", document.querySelector("#__next")).byProps({country: "us"}).props`
    );
    // This finds a React component with a prop "expandRowByClick" set to true
    const news = await page.evaluate(
      `window["resq"].resq$("*", document.querySelector("#__next")).byProps({expandRowByClick: true}).props.dataSource`
    );
Thanks. I didn't know about resq.