Hacker News new | ask | show | jobs
by lifthrasiir 783 days ago
I don't exactly understand the exact motivation of this package. Yes, truncating a UTF-8 string to a byte size limit without making it invalid is a valid problem. But if it were the only motivation, the function signature ought to be:

    func LimitPrefix(a string, n int) string
...and it should never have an error condition (which does allocate memory). The name and signature should immediately suggest the following requirements:

1. `LimitPrefix(a, n)` should always return some prefix of `a`, namely `a[:m]` where 0 < m <= n.

1a. In particular, `m` can be zero if `n` is small enough.

1b. And `m` is expected to equal to `len(a)` if `len(a) >= n`.

2. `a[:m]` should of course be a valid UTF-8 string if `a` already was.

3. `m` should be maximized under these conditions.

There are some edge cases that we have to fill in as well:

4. The conditions 1 and 1b are a reasonable expectation even for non-UTF-8 inputs. They are also easy to guarantee.

5. The condition 2 can't be efficiently extended for non-UTF-8 inputs and no justifiable use cases exist.

6. However the condition 3 depends on the condition 2 (and 1 of course). Therefore it should be replaced with something concrete, otherwise we risk an unintentional incompatibility.

7. Negative n may arise from a size calculation with a missing bound check, so treating it as zero sounds fair.

The current design, in comparison, just finds the last UTF-8 lead byte within `a[:l]`. It doesn't even help with the truncation: both `Find("thirty", 5)` and `Find("dreißig", 5)` return 4, but `"thirty"[:5]` is valid while `"dreißig"[:5]` is invalid. Also `Find("one", 5)` unexpectedly fails! An arbitrary condition of `l <= 3` is even more confusing.

---

Based on aforementioned conditions, I propose the following instead (warning: never tested):

    func LimitPrefix(a string, n int) string {
        if len(a) >= n { // Condition 1b
            return a
        }

        n = max(n, 0)      // Condition 7
        n = min(n, len(a)) // Condition 1

        bound := n - 4        // Condition 3: Assume that a[n-4:n] has one or more lead bytes.
        bound = max(bound, 0) // Condition 1a

        var i int
        extent := 4 // Condition 6: Do not truncate if no lead byte is found.
        for i = n - 1; i > bound; i-- {
            switch a[i] >> 4 {
            case 0, 1, 2, 3, 4, 5, 6, 7:
                extent = 1
                break
            case 8, 9, 0xa, 0xb:
                // Continuation byte
            case 0xc, 0xd:
                extent = 2
                break
            case 0xe:
                extent = 3
                break
            case 0xf:
                extent = 4
                break
            }
        }

        if i+extent >= n { // Condition 2
            return a[:i+extent]
        } else {
            return a[:i]
        }
    }
1 comments

Thank you for your reply.

I was thinking about whether to return an error. If we can’t find a UTF-8 start byte in the nearby 4 bytes, it’s unclear what to return. I thought maybe we could ignore this problem.

I don’t return the string itself because I don’t know if users want the start or the end of the string. Also, I want to avoid copying large strings. It’s up to the users how they use this function.

Since no one is using this package yet, we might consider changing the interface.

> I was thinking about whether to return an error. If we can’t find a UTF-8 start byte in the nearby 4 bytes, it’s unclear what to return. I thought maybe we could ignore this problem.

This highly depends on the use case, and your stated use case doesn't seem to need any sort of error.

> Also, I want to avoid copying large strings.

Go strings are not copied in that way; they are implemented like immutable slices [1].

[1] https://research.swtch.com/godata