| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by specialist 1329 days ago

> a string could be raw unknown bytes, verified UTF-8, or UCS-2 (or even UTF-16 or UCS-4)

Agreed. My future perfect programming language has the predefined types 'ascii', 'utf-8', 'url', 'base64', etc. for misc kinds of character sequences.

Just like how raw bits are different from numerals: short vs byte, word vs int, 64-bits vs double, etc.

(Any one have a better naming system for 8, 16, 32, and 64 bit chunks of raw data? 'byte', 'word', 'doubleword', 'quadword'?)

Per this "ridonkulously hard" OC article, I'll also ponder predefined types for raw 'html5', 'json', etc (as in unparsed, char sequence vs DOM).

> Perl was early with its concept of “tainted” strings.

Not being a Perl dev, I'm unfamiliar with "taint". Quickly found articles like this: https://www.geeksforgeeks.org/perl-taint-method/

In my future perfect language, char seqs cannot be cast. They must be converted. Basically syntactic sugar for Java-style char encoding infrastructure.

I have assumed that disallowing casting was sufficient. But now I'll have to ponder "taint" too. From the hip, I really like the notion of tracking the provenance of data, a la defensive programming.

Great idea. Thanks.