Hacker News new | ask | show | jobs
by dcomp 1969 days ago
Technically isn't incorrect but is a violation of SHOULD. I still think a SHOULD should be a requirement for a general purpose library. With reasons given for not following it.

Edit: And with HTTP having case-insensitive matching (which is most likely broken in lots of hand written implementations). This is rife with the possibility for errors

Taken from RFC 7230 3.2.4 Field parsing

  Historically, HTTP has allowed field content with text in the
   ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
   through use of [RFC2047] encoding.  In practice, most HTTP header
   field values use only a subset of the US-ASCII charset [USASCII].
   Newly defined header fields SHOULD limit their field values to
   US-ASCII octets.  A recipient SHOULD treat other octets in field
   content (obs-text) as opaque data.
2 comments

In the specific case of the Transfer-Encoding header, it is defined as,

     Transfer-Encoding = 1#transfer-coding

     transfer-coding    = "chunked" ; Section 4.1
                        / "compress" ; Section 4.2.1
                        / "deflate" ; Section 4.2.2
                        / "gzip" ; Section 4.2.3
                        / transfer-extension
     transfer-extension = token *( OWS ";" OWS transfer-parameter )
     token = 1*tchar
     tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
             "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
… which does not permit non-ASCII. As a generic header, maybe (but the decoding should be into ISO-8859-1, as the RFC notes…), but at the point at which you parse it into a Transfer-Encoding header, it is no longer valid.
> A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

This is not really a 'should', IMHO, because fields are defined as OCTETS, iirc. Based on that, a compliant and robust implementation must treat them as opaque data.

I still fight with case-sensitive matching breaking HTTP2 -> HTTP1.1 proxies
RFC 7230 makes it a point not to make it a MUST as that would make unknown number of existing applications non-compliant with HTTP/1.1-as-redefined. They are free to treat the incoming headers as ISO-8859-1 8bit instead of dropping to 7bit US-ASCII.
RFC 2616 defined header fields as OCTETs, and regarding this change RFC 7230 states:

> Non-US-ASCII content in header fields and the reason phrase has been obsoleted and made opaque (the TEXT rule was removed).

RFC 2616:

field-value = ( field-content | LWS )

field-content = <the OCTETs making up the field-valu and consisting of either TEXT or combinations of token, separators, and quoted-string>

Hence to me fields must be treated as opaque data for backward compatibility and robustness. If anything, existing applications that are compliant with RFC 2616 already do that, right? ;)

RFC 2616 OCTETs are defined as "<any 8-bit sequence of data>" quote unquote, nothing is said about their value beign opaque.

      TEXT           = <any OCTET except CTLs,
                        but including LWS>
IETF rewrote the productions not to use TEXT, but stopped short from banning the old behaviour.

So, for instance, where 2616 states: Reason-Phrase = <TEXT, excluding CR, LF> And 7230 has: reason-phrase = ( HTAB / SP / VCHAR / obs-text )

It is making sure that any application that conforms to 2616 still conforms to 7230 by not making it illegal (MUST) to parse obs-text... Just something you SHOULD not not do. They are simply making it so any new header added is defined as SP / VCHAR only (quoted, possibly).

Let's not argue semantics here. An arbitrary sequence of bytes is an opaque data type, it has no structure, no meaning, no assumption can be made, and it must simply be passed on as is because it can be anything.

That's why they write that it should be treated as opaque data. My point (and the point of the comment I was replying to) is that 'should' is perhaps too weak a word in the context because previous history. In any case for robustness it is a must to treat it that way.