Hacker News new | ask | show | jobs
Utf8rewind changes for 1.2.1 (bitbucket.org)
40 points by knight666 3960 days ago
8 comments

To take this another step further, you can look at generative, or property based testing. You specify properties of your input data, e.g. A sequence of 0 or more Unicode characters, and the properties that should hold, e.g. It never crashes, the length stays the same after processing, e.t.c. Then the generator will generate ever more complicated test cases to try to break the invariants, if it does then it will backtrack to find the smallest failing test case.

I find generative tests take a lot longer to think about and write, but are invaluable for flushing out all the corner cases. There are libraries for lots of languages, Haskell, Clojure, Python, Erlang, and others too I'm sure.

This is a good starting point to find such tools:

"QuickCheck is a combinator library originally written in Haskell, designed to assist in software testing by generating test cases for test suites. It is compatible with the GHC compiler and the Hugs interpreter.

In QuickCheck the programmer writes assertions about logical properties that a function should fulfill. Then QuickCheck attempts to generate test cases that falsify these assertions. The project was started in 1999. Besides being used to test regular programs, QuickCheck is also useful for building up a functional specification, for documenting what functions should be doing, and for testing compiler implementations.[1]

Re-implementations of QuickCheck exist for C,[2][3] C++,[4][5][6] Chicken Scheme,[7] Clojure,[8][9][10] Common Lisp,[11] D,[12] Elm, [13] Erlang, F#,[14] Factor,[15] Io,[16] Java,[17][18][19] JavaScript,[20] Node.js,[21] Objective-C,[22] OCaml,[23] Perl,[24] Prolog,[25][26] Python,[27] R,[28] Ruby,[29] Rust,[30] Scala,[31] Scheme,[32] Smalltalk,[33] Standard ML[34] and Swift.[35]"

https://en.wikipedia.org/wiki/QuickCheck

Fuzzing is also cool, or lifting test cases from other implementations, chances are corner cases for those are also corner cases for yours.
Alright, this is yet another example of a title change that goes from a good title to a bad title.

If I hadn't have seen the article before the title was changed, I wouldn't have read it. "Utf8rewind changes for 1.2.1" isn't exactly the best title.

Please change it back (to "Utf8rewind 1.2.1 adds 750 new unit tests for its 13 functions"). That title actually indicates what is interesting about the article, and why it was submitted, not just "here's a random changelog for a random version of a random small library that no-one has ever even seen". HN's title policy seems to have become worse and worse over time, and this is a good (bad?) example thereof.

Do they use any fuzzing as part of their testing? UTF-8 parsing seems like an ideal candidate for this kind of bug hunting.
Author here.

The library currently doesn't employ any kind of fuzzing. I rely on multiple tests for every kind of input. I've found that is a pretty reliable way to test for "unknown unknowns", even if it's extremely time-consuming.

Adding testcase fuzzing is definitely something to consider though, because it would most likely have found the very issues this release fixes.

I recommend trying fuzz testing, it can be very effective (see my experiences here: http://forwardscattering.org/post/21) Utf8-rewind looks like a nice library, I have looked into it before when researching unicode support. If I need to do more complex Unicode stuff in my language I will probably use it.
I think you have a bug in codepoint_read(). The last argument is a pointer to variable unicode_t named decoded. If the first if statement in that function is true, then the a value is never assigned to that variable, the function returns, and immediately outside the function the value of that variable is read. But that variable was never initialized.

https://bitbucket.org/knight666/utf8rewind/src/c22e458912952...

https://bitbucket.org/knight666/utf8rewind/src/c22e458912952...

https://bitbucket.org/knight666/utf8rewind/src/c22e458912952...

The prior while statement makes sure the second parameter src_size is never 0, but the pointer src might be.

Undefined behavior can be caused by passing a NULL pointer as the first parameter and a non-negative second parameter to utf8toutf16().

This is actually guarded against in a macro:

    #define UTF8_VALIDATE_PARAMETERS(_inputType, _outputType, _result) \
        if (input == 0) { \
            UTF8_SET_ERROR(INVALID_DATA); \
            return _result; \
        } \
If a NUL parameter is passed to utf8toutf16 and friends, they'll return UTF8_ERR_INVALID_DATA.

However, you are correct about codepoint_read's behavior. I actually have a test for passing NUL to the function, but the value of decoded isn't checked:

    TEST(CodepointRead, InvalidData)
    {
        const char* i = nullptr;
        size_t il = 0;
        unicode_t o;

        EXPECT_EQ(0, codepoint_read(i, il, &o));
    }
It's definitely a bug, but relatively harmless. The function is only used internally and the calling sites guard against that kind of attack.

Nevertheless, I thank you for the extra set of eyeballs. I have been looking at these functions far too long, you tend to gloss over obvious problems after a while.

I'll be sure to get a fix in for the next release. :)

    StreamState stream[4];
    // many lines follow
    memset(stream, 0, 4 * sizeof(StreamState));
It bugs the hell out of me to see array bounds and type names repeated like this. Why not this?

    memset(&stream, 0, sizeof(stream));
Some other nitpicks:

* typedefs ending in _t are reserved by POSIX, though a lot of people ignore this.

* Have you thought of breaking into multiple source files? One benefit is when linking statically you won't pull in the whole thing for just one function. Might read better too, there is a lot going on in that one source file.

And many C Standard types have a _t suffix. It really shouldn't be used.

There are multiple source files.

> There are multiple source files

Ah you are right, I guess I was looking only at public api wrappers and missed the "internal" dir. I might add that as further commentary that people might gloss over a dir called "internal" and miss the bulk of the library. :P

Wow, this is interesting. I have never heard of it before. How popular/tested is this library?

Will try to build it for my project that deals with unicode.

You've probably never heard of it because it has 136 downloads in total. ;)

It's very thoroughly tested but it hasn't been used in any serious projects yet.

Is the Turkish case conversion working?

edit: it seems it is: https://bitbucket.org/knight666/utf8rewind/pull-requests/1/t...

No, currently Turkish, Greek, Lithuanian and Azeri case mappings are not always grammatically correct. This is because utf8rewind currently does not take the system locale into account when case mapping. Fixing these issues is planned for a future release.
You can consider also set system-independent locale, e.g. "set_turkisk_mode" (I had that problem, too), etc. I thought that the only case conversion exception as the Turkish case. Do you remember which cases are an exception for Greek Lithuanian and Azeri? Also, I know that also German has some non-bijective cases ("ß" -> SS).

In case you want to save space in tables, you can opt for encoding ranges in the code, e.g. check sc_tolower()/sc_toupper() into: https://github.com/faragon/libsrt/blob/master/src/schar.c

You can find all case folding codepoints on the Unicode Consortium website: ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt and ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt.

As you get down the list you'll notice what a pain in the ass the special cases are. There's a special case for the final sigma in a Greek word:

    03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
You must remove the dot from "i" when upper or titlecasing... but only in Lithuanian:

    0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE
Etc. etc. By the way, my implementation for case mapping started out similarly to yours, but I ultimately solved the problem using a binary search in a huge look-up table: https://bitbucket.org/knight666/utf8rewind/src/c22e458912952...

Unicode case mapping is just a huge mess of exceptions, but that's more the humans' fault than the standard.

Thank you! :-)
Is the documentation available somewhere without having to download it first?
It wasn't before, but I looked up how to host static HTML on Bitbucket. You can now read the documentation online here:

http://knight666.bitbucket.org/utf8rewind/