|
|
|
|
|
by grandinj
5302 days ago
|
|
Makes some string operations more expensive because of the potential of having to convert between representations. So you have the choice of converting at source, and paying the price there, or converting during processing and paying the price there. |
|
One could have an abstract 'String' type with concrete subclasses (ANSIString, UTF8String, UTF16String, EBCDICString, etc)
Assuming that any to-be-handled character strings can be round-tripped through UTF-8 (and that probably is a workable assumption), any function working with strings could initially be implemented as:
- convert input strings to some encoding that is known to be able to encode all strings (UTF8 or UTF16 are obvious candidates)
- do its work on the converted strings
- return strings in any format it finds most suitable
Profiling, one would soon discover that certain operations (for example, computing the length of a string) can be sped up by working on the native formats. One then could provide specific implementations for the functions with the largest memory/time overhead.
The end result _could_ be that one can write, say, a grep that can work with EBCDIC, UTF8 or ISO8859-1, without ever converting strings internally. For systems working with lots of text, that could decrease memory usage significantly.
Among the disadvantages of such an approach are:
- supporting multiple encodings efficiently will take significant time that, perhaps, is better spent elsewhere.
- the risk of obscure bugs increases ('string concatenation does not quite work if string a is EBCDIC, and string b is ISO8859-7, and a ends with rare character #x; somehow, the first character of b looses its diacritics in the result')
- a program/library that has that support will be larger. If a program works with multiple encodings internally, its working set will be larger.
- depending on the environment, the work (CPU time and/or programmer time) needed to call the 'correct for the character encoding' variant of a function can be too large (in particular, for functions that take multiple strings, it may be hard to choose the 'best' encoding to work with; if one takes function chains into account, the problem gets harder)
- it would not make text handling any easier, as programmers would, forever, have to keep specifying the encodings for the texts they read from, and write to, files and the network.
[That last one probably is not that significant, as I doubt we will get at the ideal world where all text is Unicode soon (and even there, one still has to choose between UTF8 and UTF16, at the least)]
I am not aware of any system that has attempted to take this approach, but would like to be educated on them.