| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bornfreddy 1119 days ago
	Requiring every proxy and web server to implement their own cache hashing algorithm, especially one that should ignore encoding-specific "non-consequential" parts, sounds like a monumentally bad idea.

2 comments

Too 1119 days ago

The part where a cache SHOULD have “knowledge of the semantics of the content itself” in combination with “normalization is performed solely for the purpose of generating a cache key; it does not change the request itself” is the scary part.

It may sound cool and efficient on paper, just trim the whitespace and sort all json dictionaries right? But in practice it adds too much complexity, eventually implementations of this semantics will start to drift between cache and real backend. Case in point: SAML XML signatures.

This is how one creates a cache poisoning vulnerability. If a request is normalized as a cache key, use the normalized request when sending to the backend also. If you don’t trust that process you shouldn’t trust it as the cache key either.

Proxies should be dumb, just hash the raw string for the cache key.

link

garganzol 1119 days ago

This. Plus it is a good idea to specify the minimal recommended hash algorithm to have some manageable expectations on collisions. "The cache key collision rate is guaranteed to be not worse than SHA-256".

link

thwarted 1119 days ago

The cache is local to the proxy or web server, it doesn't matter what the hashing algorithm is as long as the cache accurately returns cached results given the same inputs. The semantic meaning of "input" is different for if it's a proxy or if it's the origin. The origin web server could very well cache based on the result of post processing and validation of the input, while a proxy should cache based on a much more strict (exact series of bytes) interpretation of the input.

This is no different than how any other caching proxy is expected to operate given a set of inputs. It's never been up to the proxy to interpret if the queries "name=joe%20user" and "first=joe&last=user" are the same, it just passes the input along to its upstream and then locally stores the result, assuming that the same input will occur again and save a trip to upstream.

link

bornfreddy 1119 days ago

You are assuming that proxies will correctly determine which content does not matter. From what I've seen, what will most likely happen is that we will be spending countless hours just because some box is sometimes returning wrong content, because it decides that the request is "the same".

I don't mind caching, but please make it deterministic.

link

thwarted 1118 days ago

I am assuming that dealing with how and what proxies cache is a long standing potential issue that anything here does not change at all. A caching proxy could currently use a truncated path on a GET request to build a cache key and it would not be caching the correct data. Section 2.1 in its entirety, along with the defined meanings of MAY, MUST and SHOULD, tell caches how to operate. A caching proxy not caching and returning the right content because it does not take into account the necessary data when determining the cache key is broken, and semantic understanding of the request is a local concern so any non-local cache needs to acknowledge that it doesn't have semantic understanding and use the byte array that is the body as a input to determine a cache key.

link

preseinger 1119 days ago

you're papering over the important details

namely, urls are finite, bodies are infinite

link

kortex 1119 days ago

So? Just deal with it however one would deal with unruly POST requests, slow-walked multi-part, and other protocol abuse. No matter what, you need protection against bad actors trying to get the servers to do bad things.

link

preseinger 1119 days ago

i'm not sure how malicious requests are relevant to this conversation

specifically, a URL can basically always be fully read and cached in memory in a server, a request body cannot

link