Hacker News new | ask | show | jobs
Show HN: JSON-Threat-Protection Rust High-Performance Crate (github.com)
34 points by ADD-SP 697 days ago
4 comments

For things that are claimed to be high-performance, it would be helpful to see some numbers without running it locally on our own json files.
Makes sense, numbers added.
Excellent! I think your "faster%" is calculated in a way that understates the speedup. In the last row, the document is processed in a bit less than half the time, so the speedup should be a bit more than 100%.
Haha, looks like the GitHub Copilot is not good at marketing, I have made it more marketable. Thanks for your pointing out!
"Whether to allow duplicate object entry names." This is interesting. I just did a test and it look like `jq` evaluates `{ "a": 1, "a": 2 }` to just `{ "a": 2 }`. I have always thought that this was invalid JSON. This mean that the order of keys in JSON do have some semantic meaning.
The JSON RFC (https://datatracker.ietf.org/doc/html/rfc8259#page-6) doesn't require the unique entry name, and also the fact is that many parser uses the last-win strategy like serde_json.

For human, this is invalid, but many web services accepts this kind of JSON consciously or unconsciously.

I'm guessing this may have become a feature of some services and it's hard for maintainers to break this behavior. ᵕ︵ᵕ

Btw YAML would be a proper superset of JSON if it wasn't for the fact that yaml doesn't allow repeated fields while JSON is relaxed about that.

That's just a small detail though. You can for all intents and purposes out JSON objects in YAML files and I'm still puzzled while so many people fiddle with indent in helm templates instead of just using toJson

Some YAML parsers support duplicate keys (IIRC, Ruby does…or at least whatever GitLab uses does). The disparate state of YAML parsers is what makes me sad about it…it seems like just a hard spec to implement.
For security researchers it’s also interesting which implementations parse with first-win strategy and which allow comments (I think Ruby does this).
Interestingly, ECMA-404 says the following:

> The goal of this specification is only to define the syntax of valid JSON texts. Its intent is not to provide any semantics or interpretation of text conforming to that syntax.

So it is legal JSON although not useful with a lot of concrete implementations. Maybe a way to find an exciting security vulnerability involving two parsers differing in their interpretation...

Perhaps checking a service's behavior in response to such JSON is high on the security researcher's list of things to do that are high priority and simple.

"( – ⌓ – )

"It is expected that the json-threat-protection crate will be faster than the serde_json crate because it never store the deserialized JSON Value in memory, which reduce the cost on memory allocation and deallocation."

"As you can see from the table, the json-threat-protection crate is faster than the serde_json crate for all datasets, but the number depends on the dataset. So you could get your own performance number by specifying the JSON_FILE to your dataset."

However:

"This project is not a parser, and never give you the deserialized JSON Value!"

Is this performance comparison to serde_json fair? If serde_json is a parser and has a different feature set than json-threat-protection, does it make sense to compare performance?

> If serde_json is a parser and has a different feature set than json-threat-protection, does it make sense to compare performance?

If you were using serde_json just to validate a payload before passing it on to another service (like a WAF), then the comparison makes sense. If you had more complex validations or wanted to extract some of the data, then maybe not.

Totally agreed, this is also what I want to say.
This crate is not an alternative of the serde_json, it only do the validation.

Currently, there is no other crates do the sames validation works on JSON, so I have to parse the dataset by a common JSON parser (sede_json) and do the same validation on its deserialized value as the comparable results.

So it would be better to compare to other crates which do the same work, but I didn't found the similar crate so far. And this is also the reason I developed this crate.

I don't think it was intended to say that this crate is "better" than serde_json. I interpreted it to be a measurement of the overhead of adding it as an additional step on top of parsing.
I think you may have misunderstood the article.

The point of the article is to parse AND validate input AT THE BOUNDARY between the outside world and your program, rather than a bunch of ad-hoc validations at various points after the suspect data has entered the castle walls and has already been (at least partially) processed (thus making the program state harder to reason about). By enforcing your invariants at the border, you ensure that all data entering your system always conforms to your expectations, just like a strong type system ensures that invalid states are not representable. A schema is basically a type system for your raw data.

This concept is also a major element of Domain Driven Design https://en.wikipedia.org/wiki/Domain-driven_design

Great to see this article, I totally agreed with the view that rejecting any invalid case by designing the right data structure.

Unfortunately, it is hard to achieve it in practice and people even don't realize this, JSON Object is a good example, Human are incline expecting the duplicated key is not allowed in JSON, but it happens.

For this goal, I think the Protobuf is good way to eliminate the possible invalid data for data transportation.