| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by redka 3035 days ago
	Why would you need PhantomJS for that? Can't you just parse the HTML files with Nokogiri and be done with it? That would be orders of magnitude faster anyway

3 comments

tnolet 3035 days ago

Big misunderstanding in browser land. The HTML delivered to you over the wire, the stuff Nokogiri sees, is not the stuff you see on your screen or even when doing a “view source”

link

nkozyra 3035 days ago

OK, obviously the stuff you see on your screen not matching the HTML delivered makes sense, but explain the HTML source not matching what's sent via the HTTP response. DOM can be modified, of course, JS can introduce more dynamic HTML, but view-source should always represent any non-redirected HTTP response. What is Nokogiri getting that the browser isn't (or vice versa)?

link

joatmon-snoo 3035 days ago

> view-source should always represent any non-redirected HTTP response

Not the grandfather, but generally in browsers you have two versions of HTML "source" - the canonical source, the stuff pulled down over HTTP, and the repaired source, the version that actually gets rendered.

I'm unfamiliar with Nokogiri, but I suspect that from context, it doesn't repair HTML in the same way that browsers do.

link

Kiro 3035 days ago

But it should be the same as "view source" right? The post replied to claims otherwise.

link

dewey 3035 days ago

No it's not. https://news.ycombinator.com/item?id=16514517

link

acdha 3035 days ago

It sounds like you are confusing View Source and the live developer tools DOM view.

link

apocalyptic0n3 3035 days ago

> JS can introduce more dynamic HTML, but view-source should always represent any non-redirected HTTP response

That is both true and false. Because the JS can introduce dynamic content, the source returned by the HTTP response often doesn't match the source that is rendered by the browser itself. In many cases, a site will return a skeleton (just HTML) and then make an Ajax request to populate it. In my case, it was just the skeleton HTML with a few hundred lines of JS plus a long string of JSON

link

Kiro 3035 days ago

But we're not talking about the rendered source here. We're talking about "view source", which afaik always matches what is returned by the server.

The post replied to claims that Nokogiri doesn't see this however so I'm puzzled.

link

dewey 3035 days ago

"view source" shows the source after all the javascript ran. So what a client that doesn't execute javascript (like curl) sees is different from what you see in "view source".

That's also the reason while you had to "pre-render" you javascript web apps for SEO purposes until google bot got the ability to execute javascript.

link

madeofpalk 3035 days ago

I get what you're saying now, but I believe you're mistaken about "View Source".

I've never seen "View Page Source" or "Show Page Source" be the current DOM representation. It's always the HTML what came over the wire, the same you'll get from curl (unless the server is going user agent shenanigans, which I think we can agree is out of scope here).

If you're talking about the page after Javascript is ran, the only way you're seeing that is by opening the dev tools and looking in the 'Elements' or 'Inspector' panel.

I just checked in Safari, Chrome, and Firefox and found this to be true in all of them. The distinction between the View Source and DOM Inspector is very clear.

link

detaro 3035 days ago

In what browser is this case? Chrome and Firefox it isn't. In the dev tools, you see the rendered DOM, but view source shows you the HTML from the server.

link

apocalyptic0n3 3035 days ago

I had to actually render the HTML and run the Javascript in order to populate the HTML with the data I needed to parse. The HTML does not include the parse-able data by default and is populated at runtime from JSON embedded in the Javascript in the HTML.

As far as I am aware, Nokogiri isn't capable of that and even if it is, I was unaware of that library at the time I wrote the Phantom solution (only discovered it last Summer but have yet to use it for anything)

link

redka 3035 days ago

No, Nokogiri isn't capable of that so you need an actual browser runtime. I didn't think a downloadable site would have javascript populating the page with data. But if it's only from JSON embedded in the JS from the HTML then I guess it's still possible to retrieve that and unless it requires some processing a JSON is as good as you can get.

link

apocalyptic0n3 3035 days ago

The JSON was encoded (quotes and brackets were both HTML encoded) and couldn't reliably be parsed, or at least not in a way I was satisfied with. Rendering the HTML and actually building out the page as it would normally be rendered and using the parser that I already had built made way more sense. And, at the time, Phantom was the best option I could find for it.

link

forgotmypw 3035 days ago

I think you might have missed this part:

>In 2017, they switched to a system that loads in the data from JSON stored in Javascript in the HTML

link