Requiring every proxy and web server to implement their own cache hashing algorithm, especially one that should ignore encoding-specific "non-consequential" parts, sounds like a monumentally bad idea.
The part where a cache SHOULD have “knowledge of the semantics of the content itself” in combination with “normalization is performed solely for the purpose of generating a cache key; it does not change the request itself” is the scary part.
It may sound cool and efficient on paper, just trim the whitespace and sort all json dictionaries right? But in practice it adds too much complexity, eventually implementations of this semantics will start to drift between cache and real backend. Case in point: SAML XML signatures.
This is how one creates a cache poisoning vulnerability. If a request is normalized as a cache key, use the normalized request when sending to the backend also. If you don’t trust that process you shouldn’t trust it as the cache key either.
Proxies should be dumb, just hash the raw string for the cache key.
This. Plus it is a good idea to specify the minimal recommended hash algorithm to have some manageable expectations on collisions. "The cache key collision rate is guaranteed to be not worse than SHA-256".
The cache is local to the proxy or web server, it doesn't matter what the hashing algorithm is as long as the cache accurately returns cached results given the same inputs. The semantic meaning of "input" is different for if it's a proxy or if it's the origin. The origin web server could very well cache based on the result of post processing and validation of the input, while a proxy should cache based on a much more strict (exact series of bytes) interpretation of the input.
This is no different than how any other caching proxy is expected to operate given a set of inputs. It's never been up to the proxy to interpret if the queries "name=joe%20user" and "first=joe&last=user" are the same, it just passes the input along to its upstream and then locally stores the result, assuming that the same input will occur again and save a trip to upstream.
You are assuming that proxies will correctly determine which content does not matter. From what I've seen, what will most likely happen is that we will be spending countless hours just because some box is sometimes returning wrong content, because it decides that the request is "the same".
I don't mind caching, but please make it deterministic.
I am assuming that dealing with how and what proxies cache is a long standing potential issue that anything here does not change at all. A caching proxy could currently use a truncated path on a GET request to build a cache key and it would not be caching the correct data. Section 2.1 in its entirety, along with the defined meanings of MAY, MUST and SHOULD, tell caches how to operate. A caching proxy not caching and returning the right content because it does not take into account the necessary data when determining the cache key is broken, and semantic understanding of the request is a local concern so any non-local cache needs to acknowledge that it doesn't have semantic understanding and use the byte array that is the body as a input to determine a cache key.
So? Just deal with it however one would deal with unruly POST requests, slow-walked multi-part, and other protocol abuse. No matter what, you need protection against bad actors trying to get the servers to do bad things.
It may sound cool and efficient on paper, just trim the whitespace and sort all json dictionaries right? But in practice it adds too much complexity, eventually implementations of this semantics will start to drift between cache and real backend. Case in point: SAML XML signatures.
This is how one creates a cache poisoning vulnerability. If a request is normalized as a cache key, use the normalized request when sending to the backend also. If you don’t trust that process you shouldn’t trust it as the cache key either.
Proxies should be dumb, just hash the raw string for the cache key.