Hacker News new | ask | show | jobs
by wvenable 100 days ago
I wouldn't consider it a defect in the optimizer; it's doing exactly what it's told to do. It cannot convert an nvarchar to varchar -- that's a narrowing conversion. All it can do is convert the other way and lose the ability to use the index. If you think that there is no danger converting an nvarchar that contains only ASCII to varchar then I have about 70+ different collations that say otherwise.
1 comments

Can you give an example whats dangerous about converting a nvarchar with only ascii (0-127) then using the index otherwise fallback to a scan?

If we simply went to UTF-8 collation using varchar then this wouldn't be an issue either, which is why you would use varchar in 2026, best of both worlds so to speak.

For a literal/parameter that happens to be ASCII, a person might know it would fit in varchar, but the optimizer has to choose a plan that stays correct in the general case, not just for that one runtime value. By telling SQL server the parameter is a nvarchar value, you're the one telling it that might not be ASCII.
Making a plan that works for the general case, but is also efficient, is rather trivial. Here's pseudocode from spending two minutes on the problem:

    # INPUT: lookfor: unicode
    var lower, upper: ascii
    lower = ascii_lower_bound(lookfor)
    upper = ascii_upper_bound(lookfor)
    for candidate:ascii in index_lookup(lower .. upper):
        if expensive_correct_compare_equal(candidate.field, lookfor):
            yield candidate
The magic is to have functions ascii_lower_bound and ascii_upper_bound, that compute an ASCII string such that all ASCII strings that compare smaller (greater) cannot be equal to the input. Those functions are not hard to write. Although you might have to implement versions for each supported locale-dependent text comparison algorithm, but still, not a big deal.

Worst case, 'lower' and 'upper' span the whole table - could happen if you have some really gnarly string comparison rules to deal with. But then you're no worse off than before. And most of the time you'll have lower==upper and excellent performance.

optimizer can't inspect the value? pretty dumb optimizer, then.
It's not "the value", it's "the values".
Running the optimizer for every execution of the same query is... not very optimal.
It can run it for a range of values: https://learn.microsoft.com/en-us/sql/relational-databases/p...

Also the simpler and maybe better approach is just make the decision every time as an operation in the plan, attempt the cast if it fails then scan and cast a many times the other way, if it succeeds then use the index, this isn't hard and adds one extra cast attempt on the slow path otherwise it does what everyone has to do manually in their code like this article but transparently.

The adaptive join operator does something much more complex: https://learn.microsoft.com/en-us/sql/relational-databases/p...

I'm not sure it makes sense to add more checks and another operation to every single query just for the case where the user explicitly mislabels the types. You're going to slow down everything everywhere (slightly) for a pretty obscure case. I suspect, in the long term, this would be a bad choice.