Hacker News new | ask | show | jobs
by chrismorgan 2723 days ago
Parsing SQL is made faster.

Imagine a language with five keywords, for simplicity.

  keywords = ["AS", "FROM", "INSERT", "SELECT", "WHERE"];
  tokens = [As, From, Insert, Select, Where];
Now you have a token that you need to interpret:

  let token_raw = "FROM";
  let index = keywords.binary_search(token_raw);
  let token = tokens[index];
As a binary search, that’s O(log N) string comparisons. In this worst case, it may try comparing with INSERT, then AS, then FROM before deciding index is 1, but for more keywords you have more comparisons.

With perfect hashing, you devise a single hash function that hashes all possible values uniquely. Now we have something like this:

  tokens = {914576: As, 73932: From, 5791456: Insert, 21596: Select, 86548: Where};
… with all the numbers calculated by hashing the strings at compile time. Then, instead of O(log N) string comparisons, you only need one hash and one hash table lookup:

  let token_raw = "FROM";
  let token = tokens[hash(token_raw)];
1 comments

Why is this necessary at all? The lexer is recognising tokens with a finite state machine, so it already knows which keyword it's got.
The lexer knows it has a token, but it still has to match the token identifier to the corresponding keyword. Whether this happens at the lexing or parsing stage doesn't matter much, it's still expensive when strcmp is called a bunch of times for every keyword token.
Lexers don't do strcmps, they use a finite state machine to determine tokens. For instance if we have the tokens ALTER, AND, ANY, SELECT, FROM it will compile into a state machine like this:

  switch(str[i++]){
    case 'A': 
      switch(str[i++]){
        case 'L': ...
        case 'N': 
          switch(str[i++]){
            case 'D': 
              found token!
            case 'Y:
              found token!
            default:
              syntax error!
          }
       ...
      }
    case 'S': ...
    case 'F': ...
    ...
  }
Lexers already do this, and in each place where a token is found, it can know which token it was based on the DFA state that it's in.
If you have a lot of states (postgresql has > 400 keywords to identify), the assembly encoding the switch will be several kb big, so you'll have cache misses.

Perfect hashing trades more computing for less cache misses.

[edit] Also, you shouldn't forget that sparse switch...case aren't O(1).

That's the advantage of the state machine, but in such a table the lengths are already known at compile-time, and therefore the final check will be optimized to use a word wise aligned memcmp, which beats the branchy lexer code by miles.
I don't understand what you mean, can you elaborate? Are you talking about perfect hashing? Or are you talking about doing multiple strcmps? Or about doing the state machine until you reach a unique alternative, and match the remainder using strcmp?

What I was trying to say is that the lexer is already doing a state machine, and the keyword can be determined from the state of the state machine, rather than re-determining it from the token substring itself.