| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jameshart 1600 days ago

I'd go further and argue that in general reversing a string isn't possible or meaningful.

It's just not a thing people do, so it's just... not very interesting to argue about what the 'correct' way to do it is.

Similarly, any argument over whether a string has n characters or n+1 characters in it is almost entirely meaningless and uninteresting for real world string processing problems. Allow me to let you into a secret:

there's never really such a thing as a 'character limit'

There might be a 'printable character width' limit; or there might be a 'number of bytes of storage' limit. Which means interesting questions about a string include things like 'how wide is it when displayed in this font?' or 'how many bytes does it take to store or transmit it?'... But there's rarely any point where, for a general string, it is really interesting to know 'how many characters does the string contain?'

Processing direct user text input is the only situation where you really need a rich notion of 'character', because you need to have a clear sense of what will happen if the user moves a cursor using a left or right arrow, and for exactly what will be deleted when a user hits backspace, or copied/cut and pasted when they operate on a selection. The ĳ ligature might be a single glyph, but is it a single character? When does it matter? Probably not at all unless you're trying to decide whether to let a user put a cursor in the middle of it or not.

And next to that, I just feel to argue that there is such a thing as a 'correct' way to reverse "Rĳndæl" according to a strict reading of Unicode glyph composability rules seems like a supremely silly thing to try to do.

I'd much rather, when asked to reverse a string, more developers simply said 'that doesn't make sense, you can't arbitrarily chunk up a string and reassemble it in a different order and expect any good to come of it'.

1 comments

jerf 1600 days ago

Boy, that's implicitly a good question... when's the last time I "reversed" a string, on purpose, for something useful?

It took me a bit, but I think I have an answer. It's about 15 years ago. I didn't actually do the original design, but I perpetuated it and didn't remove it. We reversed domain name strings (which, given that they are a subset of ASCII, actually is a well-defined operation) so that the DB we're using, which supported efficient prefix lookups but not suffix lookups, could be used to efficiently query for all subdomains of a given domain, by reversing the domain and using that as the prefix.

I mean this as strong support for your point, not a contradictory "gotcha". I'm a big believer in not doing lots of work to save effort or make correct something you do less than once a decade, e.g., http://www.jerf.org/iri/post/2954 . And it's not even a gotcha anyhow, because we aren't reversing a general string; we were reversing a string very tightly constrained to a subset of ASCII where the operation was fully well-defined. I can't think of when I ever reversed a general string.

link

jameshart 1600 days ago

Right - any case where you are reversing a string as part of some other operation you will have some goal in mind that is not simply 'produce the reverse of any arbitrary string'. Even if your goal is doing something like printing the crossword puzzle answers backwards at the bottom of the page, you have a tightly constrained set of possible characters so you can literally just throw an error if someone asks you to reverse a string containing a flag.

I actually should admit, for all my protesting above that you never need to do this, I did once actually implement something that "required", as part of the process, reversing a string. It should be apparent once I share what it was why I put scare-quotes around "required" though.

We wanted to test and demonstrate the localization and unicode-readiness capabilities of our software, and to verify that every UI string was actually coming from the resource file for the selected locale, and handled in a unicode-safe way.

So I implemented a program that took in the en-GB resource file, and outputted an en-AU one that contained all the original strings, just flipped upside down. This being, of course, the canonical way to localize a product for Australia.

And to turn a string upside down, you need to reverse the order of the characters, before mapping them to their unicode upside-down equivalent.

Unfortunately, the Unicode consortium do not make available a comprehensive database of which glyphs are 180º reversals of other glyphs, so my solution ended up not having comprehensive coverage of all unicode codepoints, but since my source data was en-US text that wasn't that important; what was more important was that some of the resource strings used a 'safe subset' of HTML so I needed to not turn <strong> into <ƃuoɹʇs>.

More than anything, it was probably that experience that gave me a true appreciation for what nonsense it is to try to break a string into characters and manipulate them.

(Also, while I do love the ingenuity of string reversal for suffix-based indexing, reversing a domain name for efficient prefix-based lookup can of course also be done by breaking the name up into subcomponents (thus not requiring you to care about character composition at all between dots), reversing the sequence of those parts and reassembling the string from the components in reverse order - which has the added benefit of preserving human readability of the domain name, and a natural sort order...)

link

jerf 1599 days ago

"reversing the sequence of those parts and reassembling the string from the components in reverse order"

Given that this was Perl and that's a small chunk of code, it's probably what I would have done in the same circumstance, but given that it already existed it wasn't worth shipping a migration out to the field with a new version. Generally humans didn't consult this table anyhow.

But it was good for a couple of good "wtf is that" faces from other developers the first time they look at the DB, if nothing else. They get it pretty quickly; the preponderance of "moc." and "ude." gets to be a dead giveaway pretty quickly, especially combined with some popular names ("moc.elgoog" almost sounds like a real domain Google might register someday). But still fun if you catch their face at the right moment.

link