I don't exactly understand the exact motivation of this package. Yes, truncating a UTF-8 string to a byte size limit without making it invalid is a valid problem. But if it were the only motivation, the function signature ought to be: func LimitPrefix(a string, n int) string
...and it should never have an error condition (which does allocate memory). The name and signature should immediately suggest the following requirements:1. `LimitPrefix(a, n)` should always return some prefix of `a`, namely `a[:m]` where 0 < m <= n. 1a. In particular, `m` can be zero if `n` is small enough. 1b. And `m` is expected to equal to `len(a)` if `len(a) >= n`. 2. `a[:m]` should of course be a valid UTF-8 string if `a` already was. 3. `m` should be maximized under these conditions. There are some edge cases that we have to fill in as well: 4. The conditions 1 and 1b are a reasonable expectation even for non-UTF-8 inputs. They are also easy to guarantee. 5. The condition 2 can't be efficiently extended for non-UTF-8 inputs and no justifiable use cases exist. 6. However the condition 3 depends on the condition 2 (and 1 of course). Therefore it should be replaced with something concrete, otherwise we risk an unintentional incompatibility. 7. Negative n may arise from a size calculation with a missing bound check, so treating it as zero sounds fair. The current design, in comparison, just finds the last UTF-8 lead byte within `a[:l]`. It doesn't even help with the truncation: both `Find("thirty", 5)` and `Find("dreißig", 5)` return 4, but `"thirty"[:5]` is valid while `"dreißig"[:5]` is invalid. Also `Find("one", 5)` unexpectedly fails! An arbitrary condition of `l <= 3` is even more confusing. --- Based on aforementioned conditions, I propose the following instead (warning: never tested): func LimitPrefix(a string, n int) string {
if len(a) >= n { // Condition 1b
return a
}
n = max(n, 0) // Condition 7
n = min(n, len(a)) // Condition 1
bound := n - 4 // Condition 3: Assume that a[n-4:n] has one or more lead bytes.
bound = max(bound, 0) // Condition 1a
var i int
extent := 4 // Condition 6: Do not truncate if no lead byte is found.
for i = n - 1; i > bound; i-- {
switch a[i] >> 4 {
case 0, 1, 2, 3, 4, 5, 6, 7:
extent = 1
break
case 8, 9, 0xa, 0xb:
// Continuation byte
case 0xc, 0xd:
extent = 2
break
case 0xe:
extent = 3
break
case 0xf:
extent = 4
break
}
}
if i+extent >= n { // Condition 2
return a[:i+extent]
} else {
return a[:i]
}
}
|
I was thinking about whether to return an error. If we can’t find a UTF-8 start byte in the nearby 4 bytes, it’s unclear what to return. I thought maybe we could ignore this problem.
I don’t return the string itself because I don’t know if users want the start or the end of the string. Also, I want to avoid copying large strings. It’s up to the users how they use this function.
Since no one is using this package yet, we might consider changing the interface.