Hacker News new | ask | show | jobs
by dagenix 2977 days ago
This is one of the most frustrating issues I've had with the requests/urllib3 libraries. I have no idea what I'm supposed to do with half of an HTML page or a PDF or a PNG. And even if it did, I certainly don't know how I'm supposed to handle those things when I don't get an indication that I'm likely dealing with an incomplete response. I can construct a use case where I might want this behavior (maybe a FLIF image), but it feels like such a huge outlier to me, that this being the default behavior of requests, which caters to making the common cases easy, has always felt a bit odd.

I don't begrudge the authors for this at all. After all, these are people donating their time to work on libraries solving hard problems for free that I can then use as part of my job. There is nothing that has stopped me from forking these libraries and changing the behavior myself, other than a lack of time and that, while I dislike this behavior, its only is a problem somewhat infrequently. All that being said, I really wish the response to this issue has been something other than what feels like a language-lawerly reading of the spec (requests isn't an "agent" and so that sentence in the spec doesn't apply) and the theory that someone _might_ be able to do something with an incomplete response _some_ of the time and as such, the vastly more common case should be made much more complicated.

But, anyway, I'm glad to hear that this issue is being addressed and I thank the authors for their work!

1 comments

Even after this is fixed the way you want it, webservers will still send you half of a PNG because it's a truncated file on the webserver. The webserver will give the truncated length as the length, the actual transfer will be the truncated length, requests will think all is well, and you'll have a problem.
We can't fix every problem. And fixing every problem shouldn't be a gating function for fixing some problems. Web servers generally want to send a complete response. And sometimes some middle box chops off the response halfway through. And if that origin server set a correct Content-Length header and the response isn't that long, we can be fairly certain that the response is probably incomplete. And I really can't come up with a non-contrived use-case where we want those incomplete responses to be delivered silently to client code.

It seems like what you are saying is that because a file might be corrupt we shouldn't worry about a completely unrelated case where the file is fine but the transfer is incomplete. By that logic, why do we worry about trying to report errors at all?

Er, that's not what I'm saying.

I'm saying that if you get a response from a webserver, you have to think about it being truncated. Period. No matter what the libraries you're using do for the many different cases of problem that can cause truncation.

You seem to have read a lot into my words that I didn't intend to put there.

I fully agree that there can always be garbage coming back and that any robust client needs to be aware of that happening. I'm sorry if I read too much into what you were saying, but, it sounded like you were arguing that the change shouldn't be made because there are other ways that corruption could occur that this change wouldn't solve. But, it sounds like you were just warning that this change wouldn't be a panacea? Is that accurate?
I didn't want to express any opinion on the change; I use aiohttp for my python-powered web-scale crawling needs. So yes, it's accurate that I was warning that the change would be incremental and not a panacea.

BTW aiohttp has the problem of currently having a too-strict http protocol parser, so it throws errors for many rare cases of bad webservers, which browsers don't have a problem with. As a crawler, I need to be able to work with whatever a browser will display, ...

Well, it sounds like we are in fierce agreement then. Sorry for misunderstanding your comment earlier.

I've been interesting lately in exploring at aiohttp, and your comment about it being too strict is certainly very enlightening. And I want to second your comment about libraries following the lead of what browsers will do. My strong feeling is that the battle over what HTTP libraries should do has been fought and been decided by browsers, and that whatever browsers do is what HTTP libraries should do as well, regardless of how ugly it is (IIRC, when I last checked, browsers did pay attention to the Content-Length header and wouldn't display results that were shorter than it - but if I'm misremembering, I would happily change my position with respect to honoring this header). The purist in me hates to say that, but, the pragmatist wants to get things done and fighting against browser behavior feels counter-productive at this stage.