Honest question: but isn't that just a broken implementation (and a very obvious brokenness at that)? It seems to me there's a big difference between someone not coding to the standard, and the standard making your taks impossible.
How many tools have 3-byte limits on UTF-8? The only one I can think of right now is MySQL. (The workaround is to specify the utf8mb4 character set. This is MySQL's cryptic internal name for "actually doing UTF-8 correctly.")
MySQL is one of the worst offenders for broken Unicode and collation problems arising therein. Neither it nor JavaScript deserve consideration for problems that need robust Unicode handling.
I actually switched my (low traffic, low performance needs) blog comments database from MySQL to SQLite purely because I could not make MySQL and Unicode get along. All I needed was for it to accept and then regurgitate UTF-8 and it couldn't even handle that. I'm sure it can be done, but none of the incantations I tried made it work, and it was ultimately easier for me to switch databases.
As an ugly last resort, you could store Unicode as UTF-8 in BLOB fields. MySQL is pretty good about storing binary data. (I dread the day that I'll have to do something more advanced with Unicode in MySQL than just storing it.)
I no longer recall whether I tried that and failed, or didn't get that far. Seems like a semi-reasonable approach if you don't need the database to be able to understand the contents of that column. But on the other hand, SQLite is working great for my needs too.
That is quite clearly broken, and any tool that does so should be fixed or dumped. This is not new, and Marcus Kuhn had made UTF8 test resources available for years at http://www.cl.cam.ac.uk/~mgk25/unicode.html
I believe that MySQL is one such tool. In recent versions you can work around it by asking for the encoding "utf8mb4" instead of "utf8", but I think they have to be quite recent.
So yes, another way in which MySQL is quite clearly broken.
It is totally arbitrary - there's no reason you can't have degenerate 6-byte encodings, and compliant decoders should cope with them. See Marcus Kuhn's excellent UTF-8 decoder torture test page linked elsewhere in this thread.
Actually, a truly compliant UTF-8 decoder should reject all degenerate forms, because they can be used to bypass validation checks, and the spec does not allow such forms. A four-byte UTF-8 sequence is sufficient to represent any Unicode code point (and then some), so no compliant decoder should accept any sequences of 5 or 6 bytes.
Note that Kuhn's torture test page deliberately includes a lot of invalid sequences in order to make sure that they're gracefully handled. Section 4 is dedicated to degenerate encodings like this.
File bugs with those tools. These sort of issues should have been sorted years ago and any program that can't do 3+ byte character encodings should be named and shamed.
And the problem with UTF-16 is that a lot of applications can't handle surrogate pairs, except a lot of Emoji are above the BMP, aren't they? So why is this a bigger deal for UTF-8 than UTF-16?