Hacker News new | ask | show | jobs
by jiggawatts 1329 days ago
Formatting a string twice is what lead to the Log4j security vulnerability. It had a macro language that allowed user-controlled input to have macros expanded in an unexpected place. Essentially the macro input itself could contain macros.

Your example where you use a sprintf format-string placeholder inside an incomplete JSON snippet ought never be used! Ever.

It's not needed in practice. Construct the object graph and insert the parameters there (unescaped!) and then serialize the whole thing.

E.g.:

    JsonConverter.Serialize( new {
        // Use a static format string! Never let users control this...
        context = string.Format( "{0:N1}", userControlledMaliciosInput ),
        alsoThis = "... json snippet...",
    });
This is fine.

But as I was saying, the "even better" solution is to not serialize this into JSON and then "work with the string representation". The use of JSON[1] should be a detail transparent to 99% of the application. You should be able -- safely -- to switch it out for XML, gRPC, Cap'n Proto, or whatever.

Going back to my logging example, it ought not to matter what wire format something like OpenTelemetry uses. You should be able to use "rich" object graphs in logging calls, and then let the library figure it out. E.g.:

    Log.Information( "user submitted a form", new {
        context = string.Format( "{0:N1}", userControlledMaliciosInput ),
        alsoThis = "... json snippet...",
    });
Ideally, everything should treat this as the native "object" graph whenever the developer interacts with it. Only the "edges", such as RPC serialization or deserialization needs to deal with encoding, at which point it'll need to use exactly one escaping/encoding format and not have worry about nested encodings at all.

[1] A mistake in the design of JSON is that it appears to be "simple", so beginner programmers think they understand it and can work with it "directly" using string manipulation. This is ill-defined at the best of times, and downright unsafe in surprisingly common scenarios: http://seriot.ch/projects/parsing_json.html

1 comments

We're stuck discussing JSON and other data transfer encodings, which is partly my fault as I brought it up, but there are far more scenarios for using combined text encodings.

It is very common to have templating languages which include their own syntax + the syntax of a target output language (e.g. Markdown supports HTML snippets that should get output to the final HTML as is; C macros support C code snippets, and C itself supports Assembler snippets that should end up in the final binary etc). When generating/processing the mixed format from your own code, you may often hit the problems above.

Even for JSON, there are legitimate reasons for processing stored JSON documents as text, or at least situations where it seems a safe enough approach - because people tend to forget that a string representation of a JSON document that has user-controlled input should be itself considered untrusted user input in its entirety, at least unless it is parsed by a JSON parser.

Additionally, data often has to be stored to unstructured storage (e.g. disk) between the moment you receive untrusted user input and the moment you output the final format to the user - again, doing the easy thing of storing in the intermediate format with the first level of escaping of untrusted input is extremely tempting, and the alternative is significantly more difficult.

All of the use-cases you listed I would flag in a code-review as fundamentally misguided.

If you have formats "A" and "B" with serialization functions A() and B() that take document object models as inputs (not strings!), then nesting them is valid, but a bit of a code smell.

What you're saying is that there are scenarios where A() and B() take strings and return strings, and those strings can have control codes that "mean something" for A and/or B.

That's inherently bad and dangerous, and was the direct cause of one of the WORST vulnerabilities in history. Literally as bad as anything ever out there.

You're saying "maximum bad" is a good idea sometimes. This is like making the argument that a little nuclear war is acceptable on occasion.

> there are legitimate reasons for processing stored JSON documents as text

No, there isn't. Stop. Never do this. Ever.

Don't parse HTML or XML with Regex either. It leads to m̷͉̈a̴̳̚d̶̟̐n̴̩̓e̷̘̿s̴̤͆s̵͉͗: https://stackoverflow.com/questions/1732348/regex-match-open...