Hacker News new | ask | show | jobs
by lobster_johnson 4042 days ago
Isn't the Unicode situation in OCaml more or less the same as in Erlang and Ruby 1.8, ie. "string" is just a byte string, and there's no native encoding support?

Last I checked, there was decent third-party library support in Batteries. I imagine it would be painful if you were to use Batteries' "UTF8.t" string type and had to interface with some other library that used "string" or some other string solution (like Camomile)?

2 comments

There's no built-in encoding/decoding stuff, ie. you need to use a library like Batteries, Camomile, uutf/uucp if you want to do something like capitalise, split or count characters.

Writing the appropriate glue isn't very hard, the interfaces either work with bytes or have to/from-bytes functions, but I suppose it's a bit annoying (at least when first starting out with the language) to have to figure out which lib is needed for which type of string operation, e.g. if you're into Batteries you still need Camomile (or uucp) for lowercasing:

    module C = CamomileLibraryDefault.Camomile
    module CM = C.CaseMap.Make(C.UTF8)
    module U = Batteries.UTF8
    
    let lower_initial bytes =
      U.sub (U.of_string_unsafe bytes) 0 1
      |> U.to_string_unsafe
      |> CM.lowercase
    
    let () =
      lower_initial "Åge" |> print_endline (* prints "å" *)
That's pretty horrible. Thanks for the explanation.
Erlang has no string type. Most strings are a list of integers (any size), you can put Unicode code points there if you want, or integers less than 255 if you prefer. There is also a binary/bitstring type which is an array of bits (if a multiple of 8, it's a binary). You can put whatever you want in a binary, it's binary.

If you'd like things encoded in some way, that's up to you, there is no type to help you (there is a Unicode module which can help convert between encodings)