Hacker News new | ask | show | jobs
by fauigerzigerk 4748 days ago
PhantomJS is brilliant, but Selenium is a questionable choice for this task. For some reason, the creators of Selenium have decided that passing HTTP status codes back through the API is and always will be outside the scope of their project. So if you request a page and it returns 404 you have no way to find out (other than using crude heuristics). This makes Selenium completely unusable for anything I would have used it for.

Fortunately you can do it by using phantomjs directly instead of going through the Selenium WebDriver API. Maybe one day the phantomjs WebDriver API implementation (ghostdriver) will extend the API to pass HTTP status information back to the caller. Until then, this API is unusable (at least for me).

3 comments

Well, I think the matter is a bit more complicated than that. When dealing with a full browser, you fetch a lot of resources. The status code for the first page fetch may be easily obtained, but your API gets very wonky as soon as you want to get status codes for all linked resources. Even if you managed that, any Ajax requests would complicate things, especially if they have deferred loading. And then you have WebSockets.

There are tools, such as BrowserMob Proxy, far better suited for monitoring HTTP traffic. And they'll get you all the headers. You can even capture to HAR so you measure performance.

Difficult edge cases are never a good reason not to support the 99.9% case.

Also, phantomjs has access to all the information you want and the WebDriver API already has a capabilities negotiation facility.

[Edit] Don't forget that the original URL is the only one supplied by the client of the API. It may be incorrect for very different reasons than all the other resources included by the page itself. That's why it is justified to treat it as a special case.

These aren't edge cases. They're asked about constantly. Most people are using Selenium because they care about everything on the page. Otherwise, your stdlib HTTP client would be sufficient.

That aside, if PhantomJS already has the info, you can always fetch it with executeScript.

If you do feel that strongly about the status code part though, I'd urge you to comment on the public draft of the W3C spec: http://www.w3.org/TR/webdriver/

From the point of view of simulating actual users, the fact that some random third-party resource on the page failed to load is not particularly relevant. That happens all the time as I browse around the web, and I never have to care about it as long as the site continues to function. So it very much is an edge case compared to the page itself failing to load.
A JavaScript file failing to load will bork most pages. A CSS file failing to load or a key image will cause most people to quit. And an Ajax request failing in a single-page app will render it useless.

But, my point of view is from actual Selenium users. This is framed by providing support on the IRC channel, on the mailing lists, triaging the issue tracker, and by interacting with people at SeleniumConf and the local Boston meetup. It's not some fringe use case and I'm not arguing the point for the sake of arguing it. The original supposition that it's an edge case is not accurate. And sure, the web breaks. That's why people using Selenium would like a way to catch that. And that's a big part of why the BrowserMob Proxy project exists.

"A JavaScript file failing to load will bork most pages. A CSS file failing to load or a key image will cause most people to quit."

Wha?

Sure, if, say, "app.js" fails to load, you have a problem.

But an analytics script?

A 3rd party ad script (which is what the GP gave as an example)?

These things can and do fail all the time.

I believe you can't use execute because any JavaScript you supply runs inside the page. You don't have access to the phantomjs specific callbacks you need to intercept http traffic.
That's unfortunate. I don't work on PhantomJS, but I can try to track down someone on the team and see if there's a way to attach a handle to window or something.
To follow up to your edit, that may be true in one case. But it's perfectly reasonable to navigate via clicking, anything in the navigate API, JS actions, meta refreshes, and so on. Even in that one case, most people would expect redirects to be followed and basic auth protected pages to submit. Again, all tractable problems, but ones that are likely better handled by an interstitial layer where you can see the entire chain of requests & responses.
Browsers do see the entire chain of requests and responses. All of it. Some browsers make that information available externally. I just don't see why a browser remote control solution like Selenium shouldn't pass on as much of this information as possible.
Phantomjs handles everything you mention (status codes on large numbers of resources, ajax, deferred loading monitoring and HAR output) with the possible exception of websockets - I have not tried and very little documentation today but it should work. The big limitation is this is WebKit-only right now.

For example: here's the wiki on network monitoring including HAR: https://github.com/ariya/phantomjs/wiki/Network-Monitoring

The API seems pretty clean to me but I guess that is a matter of opinion.

You could always write a simple proxy in python and simply route all of your traffic through that.

See: http://voorloopnul.com/blog/a-python-proxy-in-less-than-100-...

BrowserMob Proxy is the go-to tool for use with Selenium:

http://bmp.lightbody.net/

That would add quite a lot of complexity to achieve something rather trivial.
Aren't you stuck with JavaScript then? Sure, PhanthomJS is awesome, but Python is even in the title, so it's not just a side note.
Yes - Unless you are parsing static HTML you will need the rest of the browser's functionality which is implemented as a JavaScript engine. You will also need the original content from the website which will be in JavaScript.

In theory you could recreate this in another language such as Python but you would have to both parse the JavaScript from the website and implement a full browser.

No, phantomjs includes a webserver module. That's what ghostdriver uses to implement the WebDriver API and you can use it to implement a custom API that you call from Python. So you have to use JavaScript to implement the API, but you can use Python to implement your tests or web data extraction or whatever your actual task is.