I would think stacking 10 generic conditions wouldn't scale if you are trying to mix and match arbitrarily. If you are trying to mix NonEmpty and AllEven and AllGreaterThan100 for the List example, then you would get the combinatorial explosion of types.
In practice, I find it is usually something like 'UnvalidatedCustomerInfo' being parsed into a 'CustomerInfo' object, where you can validate all of the fields at the same time (phone number, age, non-null name etc.). Once you have parsed it into the internal 'CustomerInfo' type - you don't need to keep re-validating the expected invariants everywhere this type is used. There was a good video on this that I wish I could find again, where the presenter gave the example of using String for a telephoneNumber field instead of a dedicated type. Any code that used the telephoneNumber field would have to re-parse and validate it "to be safe" it was in the expected format.
The topic of having untrusted external types and internal trusted types is also explained in the book `Domain Modeling Made Functional` which I highly recommend.
> If you are trying to mix NonEmpty and AllEven and AllGreaterThan100 for the List example, then you would get the combinatorial explosion of types.
This is overthinking it. Usually, when people are not used to doing constructive data modeling, they get caught up on this idea that they need to have a datatype that represents their data in some canonical representation. If you need a type that represents an even number, then clearly you must define a type that is an ordinary integer, but rules out all odd numbers, right?
Except you don’t have to do that! If you need a number to always be even (for some reason), that suggests you are storing the wrong thing. Instead, store half that number (e.g. store a radius instead of a diameter). Now all integers are legal values, and you don’t need a separate type. Similarly, if you want to store an even number greater than 100, then use a natural number type (i.e. a type that only allows non-negative integers; Haskell calls this type `Natural`) and store half that number minus 102. This means that, for example 0 represents 102, 1 represents 104, 2 represents 106, 3 represents 108, etc.
If you think this way, then there is no need to introduce a million new types for every little concept. You’re just distilling out the information you actually need. Of course, if this turns out to be a really important concept in your domain, then you can always add a wrapper type to make the distinction more formal:
newtype EvenGreaterThan100 = EvenGreaterThan100 Natural
evenGreaterThan100ToInteger :: EvenGreaterThan100 -> Integer
evenGreaterThan100ToInteger (EvenGreaterThan100 n) = (toInteger n * 2) + 102
integerToEvenGreaterThan100 :: Integer -> Maybe EvenGreaterThan100
integerToEvenGreaterThan100 n
| n < 100 = Nothing
| otherwise = case n `quotRem` 2 of
(q, 0) -> Just (EvenGreaterThan100 q)
(_, _) -> Nothing
Of course, this type seems completely ridiculous like this, and it is. But that’s because no real program needs “an even number greater than one hundred”. That’s just a random bag of arbitrary constraints! A real type would correspond to a domain concept, which would have a more useful name and a more useful API, anyway.
You can go as far (or as short) as the application warrants. The more static evidence you want, the more cumbersome it's going to become. That seems sort of a natural progression, but the the point is that you get to choose the trade-off. It's often easier to prove a special case than a more general case.
(Certainly, Haskell is probably not the most concise language for this kind of thing. LiquidHaskell adds interesting proof capabilities wrt. arithmetic.)
Regardless, even just parsing at the boundary and using an opaque type
MySuperSpecialInt = Int @Even Min(2) Max(100)
(or whatever syntax you want) is still better than just using Int. At least you'll know that you'll always be handed a value that's in range (post-parsing).
Yes, certainly. Constructive data modeling is useful for many things, but it’s not a panacea. Other techniques are useful in cases where it’s not practical; I discuss some of the tradeoffs in this followup post: https://lexi-lambda.github.io/blog/2020/11/01/names-are-not-...
You would also run into a form of "type erasure" in most cases and would need a combinatorial explosion of methods unless the type system was able to 'decorate' the generic parameter with additional information.
e.g. imagine a `parseNonEmpty` and a `parseAllEven` method. Both take a list and return either a `NonEmpty` or `AllEven` type. If I call `parseNonEmpty` and get a `NonEmpty`, then pass to `parseAllEven`, I now have erased the `NonEmpty` as I'm left with an `AllEven`. I would need a `parseAllEvenFromNonEmpty` to take a `NonEmpty` and return a `AllEven & NonEmpty`.
In Zig at least, the types-as-values framework means it'd be pretty easy to support arbitrary mixing and matching. If nobody beats me to it, I'll make a library to remove the boilerplate for that idea this weekend.
> In practice, I find it is usually something like 'UnvalidatedCustomerInfo' being parsed into a 'CustomerInfo' object, where you can validate all of the fields at the same time
In my experience you usually can't validate them all at the same time. For example, address. You usually don't validate that until after the customer has selected items, and then you find out that some items won't deliver to their area, so whereas you previously had a Valid basket, now it's an Invalid state.
Somebody somewhere in system will be checking. At the boundary likely scales the best because you dont burn cycles that otherwise would be distributed and likely redundant in system. If you have a massive model, its feasible that it makes sense to defer partial/subparsing?
Parser combinators seem to be pretty rippin fast for the most part, at least those ive used in ocaml and rust.
I think what you're getting at is that it seems ponderous to have types named things like NonEmptyListWhereTheThirdElementIsTheIntegerFourAndTheOtherElementsAreStringsOfLengthSixOrUnder and the answer is that you shouldn't do that, but instead name it something in the problem domain (of whatever the program is about) like WidgetDescription or whatever.
And naming is actually a valuable activity. Knowing this is not merely NonEmptyListWhereTheThirdElementIsTheIntegerFourAndTheOtherElementsAreStringsOfLengthSixOrUnder but actually WidgetDescription is a valuable insight.
Deciding this thing is specifically a WidgetDescription, not a Widget or a WidgetLabel, or a WidgetAssociatedText and definitely not a ThingyDescription, can help both users and other developers produce a mental model of what's going on that results in a better experience for everyone.
No the trickier problem is that without dependent types you are forced into a very specific, linear chain of validation or else deal with a combinatorial explosion of functions and types.
To take your type as an example, you could imagine a function
validation : String -> Maybe FinalWidget
but maybe `validation` is really big and unwieldy and you want to reuse parts of it elsewhere so you break it down into a pipeline of
-- Let's say a RawWidget is, operationally, a non-empty string
validation0 : String -> Maybe RawWidget
-- Let's say a RefinedWidget is a string consisting only of capital letters
validation1 : RawWidget -> Maybe RefinedWidget
-- A FinalWidget is a non-empty string of capital letters that has no whitespace
validation2 : RefinedWidget -> Maybe FinalWidget
This is over-constrained. You don't really want to force yourself into a scenario where you must call validation0, then validation1, and finally validation2 because maybe in another code path it's more expedient to do it in another order. But the types don't line up if you do it in another order. And maybe you don't really care about `RawWidget` and `RefinedWidget`, but you're forced to create them just to make sure that you can build up to a `FinalWidget`.
This is where dependent types would really help relax those constraints.
I don't disagree that dependent types would help (and be really cool for lots of other uses!), but let's consider what the usual validation rules that we really need are: non-empty, basic interval constraints (non-negative/positive), only contains a certain set of characters... simple stuff like that, usually. If we're going wild, an interesting case would be effectful validation and how that fits in. In practice, what happens with any non-basic validation is that the server says 3xx, try again.
Anyway, validation/parsing is mostly pretty simple stuff where the "validate" bit is a simple function... and function composition works just fine.
(Assuming you can name the result type of your parse/validate individually according to your domain.)
I do think you can... just via phantom type parameters and type-level programming. In Scala you'd probably use Refined.
(But I'm not expert, admittedly, and I isn't an actual problem of much consequence in practical programming in Haskell or Scala. Opaque types do the 80% bit of 80-20 just fine.)
You can't with phantom type parameters and type-level programming alone, although you can get close. Scala's and Haskell's Refined both don't let you do what I'm thinking of.
You can get very close with type-level sets although at this point compile times probably go through the roof. You're basically emulating row types at this point.
def wrapIntoRefined(str: String): Refined[String, Unit]
def validate0[A](str: Refined[String, A]): Either[Error, Refined[String, And[Condition0, A]]]
def validate1[A](str: Refined[String, A]): Either[Error, Refined[String, And[Condition1, A]]]
// This requires ordering Condition0 before Condition1 but if we resorted
// to a type-level set we could get around that problem
def process(input: Refined[String, And[Condition1, And[Condition0, Unit]]]): Unit
// But linearity is still required in some sense. We can't e.g. do our checks
// in a parallel fashion. You still need to pipe one function right after another
The central problem is if you have two validation functions
if you try to recombine them downstream, you don't know that `Refined[String, Condition0]` and `Refined[String, Condition1]` actually refer to the same underlying `String`. They could be refined on two completely separate strings. To tie them to a single runtime String requires dependent types.
You can approximate this in Scala with path-dependent types, but it's very brittle and breaks in all sorts of ways.
> isn't an actual problem of much consequence in practical programming in Haskell or Scala. Opaque types do the 80% bit of 80-20 just fine.
I think this is only true because there isn't a production-ready dependently typed language to show how to use these patterns effectively. In much the same way that "parse don't validate" isn't really much of a problem of consequence in older style Java code because sum types aren't really a thing, if there was an ergonomic way of taking advantage of it, I firmly believe these sorts of dependently typed tagged types would show up all over the place.
At least in Haskell this can be expressed as type classes. For each condition, you can create a (possibly empty) type class to guarantee that the condition is met. Then the call site type class constraints will be checked at compile time.
This doesn't really work because you still end up needing specific types for the output of every "parser" you have and then you still need a way of combining those types together.
Or you get the ability to forge evidence (e.g. you use the evidence provided by a parser for one integer as evidence for another).
This works better for dependency injection scenarios (the Has* pattern).
function processUserInput(input: String, requirement: InputReqs[input]): Unit = ...
type InputReqs[input] = {
// We'll say that a Proposition takes Boolean expressions and turn them into types
notTooLong: Proposition[length(input) < 128],
authorIsNotBlocked: AuthorIsNotBlocked[input],
sanitized: Sanitized[input],
...
}
where you might have the following functions (which are all examples of parse don't validate)
function checkIfAuthorIsBlocked(author: String, input: String): Maybe[AuthorIsNotBlocked[input]] = ...
// Create a pair that contains both the sanitized String and a tag that it has been sanitized
function sanitizeString(input: String): (output: String, Sanitized[output]) = ...
where just by types alone I know that e.g. length checking must occur after sanitization (because sanitizeString generates a new `output` that is distinct from `input`) and don't have to write down in docs somewhere that you might cause a bug if you check lengths before sanitization because maybe sanitization changes the length of the input.
Note that this is also strictly stronger than a simple precondition/postcondition system or some sort of assertion system because properties of the input that we care about may not be observable at runtime/from the input alone (e.g. AuthorIsNotBlocked can't be asserted based only on input: you'd have to change the runtime representation of input to include that information).
In practice, I find it is usually something like 'UnvalidatedCustomerInfo' being parsed into a 'CustomerInfo' object, where you can validate all of the fields at the same time (phone number, age, non-null name etc.). Once you have parsed it into the internal 'CustomerInfo' type - you don't need to keep re-validating the expected invariants everywhere this type is used. There was a good video on this that I wish I could find again, where the presenter gave the example of using String for a telephoneNumber field instead of a dedicated type. Any code that used the telephoneNumber field would have to re-parse and validate it "to be safe" it was in the expected format.
The topic of having untrusted external types and internal trusted types is also explained in the book `Domain Modeling Made Functional` which I highly recommend.