Hacker News new | ask | show | jobs
by zelphirkalt 867 days ago
I always understood "parse don't validate" a bit differently. If you are doing the validation inside of a constructor, you are still doing validation instead of parsing. It is safer to do the validation in one place you know the execution will go through, of course, but not the idea I understand "parse don't validate" to mean. I understand it to mean: "write an actual parser, whatever passes the parser can be used in the rest of the program", where a parser is a set of grammar rules for example, or PEG.
1 comments

I'm not a Haskell developer, so it's possible that I misunderstood the original "Parse, Don't Validate" post.

>If you are doing the validation inside of a constructor, you are still doing validation instead of parsing.

Why that would be considered validation rather than parsing?

From the original post:

>Consider: what is a parser? Really, a parser is just a function that consumes less-structured input and produces more-structured output.

That's the key idea to me.

A parser enforces checks on an input and produces an output. And if you define an output type that's distinct from the input type, you allow the type system "preserve" the fact that the data passed a parser at some point in its life.

But again, I don't know Haskell, so I'm interested to know if I'm misunderstanding Lexi Lambda's post.

Parse don't validate means that if you want a function that converts an IP address string to a struct IpAddress{ address: string } you don't validate that the input string is a valid IP address then return a struct with that string inside. Instead you parse that IP into raw integers, then join those back into an IP string.

The idea is that your parsed representation and serializer are likely produce a much smaller and more predictable set of values than may pass the validator.

As an example there was a network control plane outage in GCP because the Java frontend validated an IP address then stored it (as a string) in the database. The C++ network control plane then crashed because the IP address actually contained non-ASCII "digits" that Java with its Unicode support accepted.

If instead the address was parsed into 4 or 8 integers and was reserialized before being written to the DB this outage wouldn't have happened. The parsing was still probably more lax than it should have been, but at least the value written to the DB was valid.

In this case it was funny Unicode, but it could be as simple as 1.2.3.04 vs 1.2.3.4. By parsing then re-serializing you are going to produce the more canonical and expected form.

Perhaps "normalize" or "canonicalize" is more appropriate. A parser can liberally interpret but I don't take it to imply some destructured form necessarily. There are countless scenarios where you want to be able to reproduce the exact input, and often preserving the input is the simplest solution.

But yes usually you do want to split something into it's elemental components, should it have any.

Thanks for that explanation! I hadn't appreciated that aspect of "parse, don't validate," before.

But even with that understanding and from re-reading the post, that seems to be an extra safety measure rather than the essence of the idea.

Going back to my original example of parsing a Username and verifying that it doesn't contain any illegal characters, how does a parser convert a string into a more direct representation of a username without using a string internally? Or if you're parsing an uint8 into a type that logically must be between 1 and 100, what's the internal type that you parse it into that isn't a uint8?

Personally I don't think I would have used the phrase "parse don't validate" for something like a username. It isn't clear to me what it would mean exactly. I generally only thing of this principle for data that has some structure, not as much a username or number from 1-100.

IP address would be about the minimum amount of structure. Something else would be like processing API requests. You can take the incoming JSON and fully parse it as much as possible, rather than just validate it is as expected (for example drop unknown fields)

> Or if you're parsing an uint8 into a type that logically must be between 1 and 100, what's the internal type that you parse it into that isn't a uint8?

Just for the sake of example, your internal representation might start from 0, and you just add 1 whenever you output it.

Your internal type might also not be a uint8. Eg in Python you would probably just use their default type for integers, which supports arbitrarily big numbers. (Not because you need arbitrarily big numbers, but just because that's the default.)