Hacker News new | ask | show | jobs
by rnhmjoj 3297 days ago
The problem with working with strings in Haskell is that there are too many datatypes: Data.Text, Data.Text.Lazy, Data.ByteString Data.ByteString.Char8, Data.ByteString.Lazy, Data.ByteString.Lazy.Char8. All of them share the same function names so you have to do imports like

    import qualified Data.ByteString as BS
    import qualified Data.Text.Lazy as TL
    import qualified Data.Text as TS
and somehow the library you need always use a different ByteString variant from the one you already chose so you have to pack/unpack here and there. There should be a way to make `length`, `map` and all work on every string type. Maybe a type class or some better idea is needed.

By the way the link to the caesar cipher is broken.

5 comments

I know it's only a cure for the symptom, but there's the OverloadedStrings extension
Surely the proper way to do length is to have it be, essentially:

    length :: forall f a n. Foldable f => Semiring n => (f a) -> n
`length` is not specific to any one type and it's a complete waste to think of it as something to do with strings in general. They should all just be foldable and that'll give way more.

Likewise, they could all just be functors and `map` is free.

Are you sure that Semiring shouldn't be a Monoid?

What happens if you feel the urge to define several Semirings (or Monoids) on the data type?

I'm not sure I follow, sorry. The line says that we also pledge to produce a Semiring from this Foldable structure. Given it's a length function it's about as general a specification I think you can give it without becoming something completely different.
To the first question, a Semiring has two operations, + and × conventionally (along with ...). That seems to be more than length needs. A Monoid is just an operation and its identity, which should be exactly what you need for length.

For the second question, traditional (+,×,...) is a Semiring for integers; so is (&,|,...) (bitwise boolean operations). How can you define both Semiring implementations without getting them confused?

There is another reply to my comment that I haven't digested yet. Maybe it answers the second.

"A Monoid is just an operation and its identity, which should be exactly what you need for length."

And I'm wrong about that. You need a zero and a one, right. What would that structure be?

I've been experimenting with answering that question.

An extremely heavily-commented approach to defining equality for things like Double (where there are multiple sensible ways to compare things) will probably be a better introduction to these ideas:

https://github.com/mrkgnao/noether/blob/master/library/Noeth...

https://github.com/mrkgnao/noether/blob/master/library/Noeth...

The rest of the repo implements abstract algebraic structures along similar lines.

Here is what monoids look like in this formalism at present. It's a bit involved: first I use a fine-grained separation into highly polymorphic Magma and Neutral classes:

https://github.com/mrkgnao/noether/blob/master/library/Noeth...

And there are "strategies" for combining those pieces of structure, whose use looks like

  type instance MonoidS (op :: BinaryNumeric) Rational 
    = DeriveMonoid_Semigroup_Neutral op Rational
Which means "derive the preferred monoid structure with operation op" (MonoidS op) on Rational from the preferred semigroup structure and preferred neutral element on it with that operation.

https://github.com/mrkgnao/noether/blob/master/library/Noeth...

You can always use some other monoid structure for the same combination of type and operation (!) if you like. For instance, here's a demo of using the self-action of a ring on itself (i.e. considering a ring as a module over itself) to compute something, even if your preferred choice of action has been declared to be something else (or hasn't been defined at all).

https://github.com/mrkgnao/noether/blob/master/library/Noeth...

This says "here, use the self-action derived from the multiplicative magma structure on R".

That is way more advanced Haskell than I am familiar with, but it does look like what I was wondering about. Thanks!
I'm not familiar with UTF-8, which -- I believe -- is what Data.Text is used to represent. Do all UTF-8 strings have a well-defined length?

A list of ASCI chars obviously does (the length of the list), but I'm not sure about Data.Text.Text.

Yes, I'm sure the length of a UTF-8 string is something tricky but Data.Text has a length function nonetheless. The APIs are very similar if not identical. You can see it here:

https://hackage.haskell.org/package/text-1.2.2.2/docs/Data-T...

https://hackage.haskell.org/package/bytestring-0.10.8.1/docs...

Of course, a real string type would store length along with the characters, making the length function trivial and fast.
It's hard to say what length should be because it could mean different things. How many u8 values there are? How many graphemes are present?

If it's just counting how many u8's in a slice, that's trivial. But storing the total number of variable-length graphemes isn't. I assume that's why Data.Text's length function is O(n), and other languages that have UTF-16 or UTF-8 strings also have length functions that are linear.

Graphemes? Strings can be expected to be transformed so that they contain the "right" type of characters (in particular, plain letters followed by the respective combining diacritical marks vs a smaller number of precomposed accented letters) before asking about their length, which is how many characters they contain. If you want to count "graphemes" you can have a different function and possibly a separate cached result.

This in theory. It is readily apparent from the Data.Text page at https://hackage.haskell.org/package/text-1.2.2.2/docs/Data-T... that the authors care about performance only selectively ("fusion" of buffer allocations is fun, relatively messy data structures to cache important data such as string length are not fun), and that they don't care enough about Unicode to add serious string-level abstractions over the Unicode tables exposed in the Char type.

Well, how long is the sequence "a<Backspace>". It's two characters and one, two or zero letters?
This has been covered a few times before. See this for example: https://mathiasbynens.be/notes/javascript-unicode

Data.Text seems to recognize symbols made of several codepoints (like emoji) as one but still counts diacritics and combining characters as different symbols.

Data.Text is represented as a UTF-16 vector
This is what the backpack feature landing in GHC 8.2 will allow you to do
As far as I can see, backpack isn't going to solve converting between string types, unless you want to use the same string type for everything your application does (e.g. sending/receiving bytes over the network and displaying a UI component with text in it).

Or are you suggesting that we specialize the TCP receive function to only receive UTF-8 strings, so we don't have to convert when displaying to the user?

With Backpack most of the time you would not have to think about the type of the string.

See: http://blog.ezyang.com/2016/09/the-base-of-a-string-theory-f...

Doesn't mono-traversable/classy-prelude solve this?