| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by redka 3035 days ago
	Well with Chrome going headless there isn't a whole lot of place for PhantomJS anyway. Or is there? What is it still good for?

3 comments

apocalyptic0n3 3035 days ago

Legacy systems for one. The Cooperative Patent Classification group releases their classifications en masse as HTML (single zip download, which is great). I built a parser for a PHP project that could parse all several hundred thousand records from the HTML in a few minutes. In 2017, they switched to a system that loads in the data from JSON stored in Javascript in the HTML (it is every bit as terrible as you imagine). Obviously loading in the HTML and trying to use regex to match the JSON was a terrible idea (especially since it was encoded to boot...), so I instead used Phantom to load each file, render it, and save it to a temporary file which I then parse using the original pre-2017 parser. Like 10 lines of code in Phantom to do it.

Obviously with my situation, this is not the end of the world. I use the parser twice a year and Phantom will continue to handle that task just fine. But I also know that the switch to using headless Chrome would be an expensive one if necessary; we have to research it, we have to update local dev environments, we have to implement it, we have to write new tests for it, we have to test it, we have to updating our deployment strategy, update our server deployment configuration, and, worst of all, get all of these changes and new software installations approved by the USPTO which is a nightmare. My situation is simple, but would take several weeks to several months to actually deploy to production. As it stands, I will likely have to explain why we have a now-unmaintained piece of software on the server and may be forced to switch regardless.

I can easily imagine how this project sunsetting, even though there is a clear alternative and successor, could be a nightmare to a lot of people. It's not the end of the world, but it's definitely unfortunate

feelin_googley 3035 days ago

Is this the data you were trying to parse?

https://www.cooperativepatentclassification.org/Archive.html

apocalyptic0n3 3034 days ago

Yes, but I just realized I was mistaken. The data I was talking about was the International Patent Classification. CPC was XML, IPC is HTML, and the former/now-deprecated US patent classification system was plain text. I have to deal with all three on a regular basis and have built importers for all three, and I forget which one is which.

IPC can be downloaded from the link below. I needed the Valid Symbol List. Looks like they fixed the encoded JSON that was there when they first put out the new format.

http://www.wipo.int/classifications/ipc/en/ITsupport/Version...

redka 3035 days ago

Why would you need PhantomJS for that? Can't you just parse the HTML files with Nokogiri and be done with it? That would be orders of magnitude faster anyway

tnolet 3035 days ago

Big misunderstanding in browser land. The HTML delivered to you over the wire, the stuff Nokogiri sees, is not the stuff you see on your screen or even when doing a “view source”

nkozyra 3035 days ago

OK, obviously the stuff you see on your screen not matching the HTML delivered makes sense, but explain the HTML source not matching what's sent via the HTTP response. DOM can be modified, of course, JS can introduce more dynamic HTML, but view-source should always represent any non-redirected HTTP response. What is Nokogiri getting that the browser isn't (or vice versa)?

joatmon-snoo 3035 days ago

> view-source should always represent any non-redirected HTTP response

Not the grandfather, but generally in browsers you have two versions of HTML "source" - the canonical source, the stuff pulled down over HTTP, and the repaired source, the version that actually gets rendered.

I'm unfamiliar with Nokogiri, but I suspect that from context, it doesn't repair HTML in the same way that browsers do.

Kiro 3035 days ago

But it should be the same as "view source" right? The post replied to claims otherwise.

apocalyptic0n3 3035 days ago

> JS can introduce more dynamic HTML, but view-source should always represent any non-redirected HTTP response

That is both true and false. Because the JS can introduce dynamic content, the source returned by the HTTP response often doesn't match the source that is rendered by the browser itself. In many cases, a site will return a skeleton (just HTML) and then make an Ajax request to populate it. In my case, it was just the skeleton HTML with a few hundred lines of JS plus a long string of JSON

Kiro 3035 days ago

But we're not talking about the rendered source here. We're talking about "view source", which afaik always matches what is returned by the server.

The post replied to claims that Nokogiri doesn't see this however so I'm puzzled.

apocalyptic0n3 3035 days ago

I had to actually render the HTML and run the Javascript in order to populate the HTML with the data I needed to parse. The HTML does not include the parse-able data by default and is populated at runtime from JSON embedded in the Javascript in the HTML.

As far as I am aware, Nokogiri isn't capable of that and even if it is, I was unaware of that library at the time I wrote the Phantom solution (only discovered it last Summer but have yet to use it for anything)

redka 3035 days ago

No, Nokogiri isn't capable of that so you need an actual browser runtime. I didn't think a downloadable site would have javascript populating the page with data. But if it's only from JSON embedded in the JS from the HTML then I guess it's still possible to retrieve that and unless it requires some processing a JSON is as good as you can get.

apocalyptic0n3 3035 days ago

The JSON was encoded (quotes and brackets were both HTML encoded) and couldn't reliably be parsed, or at least not in a way I was satisfied with. Rendering the HTML and actually building out the page as it would normally be rendered and using the parser that I already had built made way more sense. And, at the time, Phantom was the best option I could find for it.

forgotmypw 3035 days ago

I think you might have missed this part:

>In 2017, they switched to a system that loads in the data from JSON stored in Javascript in the HTML

minitoar 3035 days ago

Maintaining systems already built on top of PhantomJS.

toomuchtodo 3035 days ago

A bit concerning, as youtube-dl relies on PhantomJS currently.

netheril96 3035 days ago

youtube-dl will do fine. It is updated once in several days, and with that activity count, I think they will transition to headless chrome in no time.

bklaasen 3035 days ago

Amazingly, youtube-dl works very reliably in Termux[1]. I can't see that surviving a transition to headless Chrome.

[1] https://termux.com/

paulie_a 3035 days ago

I am curious about this aspect and probably should do some research, but how will highcharts to PDF work?

Phantomjs was generally great for that type of rendering

epx 3035 days ago

Not sure whether it is as easy to use as PhantomJS.

nkozyra 3035 days ago

I'd say Puppeteer is on-par with Phantom for ease of basic use. It has a richer, deeper API, of course, but at its core it's modern Javascript.

chucksmash 3035 days ago

+1 on Puppeteer. Using it for something now. For small projects, the ability to have the JS you want to run within the context of the page itself live side by side with your browser instrumentation code feels magical. Head and shoulders nicer experience than in cases where half of your logic is second class code-as-a-string (e.g. trying to work directly with Gremlin Server from a non-JVM language by POSTing Groovy-as-a-string)

vorg 3034 days ago

> half of your logic is second class code-as-a-string (e.g. trying to work directly with Gremlin Server from a non-JVM language by POSTing Groovy-as-a-string

It must be particularly difficult when your Groovy-as-a-string script itself has many strings in its code, which is what a typical Apache Groovy build script for Gradle looks like.

epx 3035 days ago

Thanks for the info.

redka 3035 days ago

Well that depends if you're stuck with Javascript. There isn't anything simpler (that I'm aware of - bu I do web scraping/automation professionally for about 6 years) than watir[0]. PhantomJS doesn't even come remotely close.

[0] http://watir.com/