Hacker News new | ask | show | jobs
by simonw 953 days ago
If you're curious how Git knows the syntax of different languages in order to support this kind of feature, take a look in https://github.com/git/git/blob/master/userdiff.c

Here's how support for Python and Ruby are defined:

    PATTERNS("python",
        "^[ \t]*((class|(async[ \t]+)?def)[ \t].*)$",
        /* -- */
        "[a-zA-Z_][a-zA-Z0-9_]*"
        "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?"
        "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"),
        /* -- */
    PATTERNS("ruby",
        "^[ \t]*((class|module|def)[ \t].*)$",
        /* -- */
        "(@|@@|\\$)?[a-zA-Z_][a-zA-Z0-9_]*"
        "|[-+0-9.e]+|0[xXbB]?[0-9a-fA-F]+|\\?(\\\\C-)?(\\\\M-)?."
        "|//=?|[-+*/<>%&^|=!]=|<<=?|>>=?|===|\\.{1,3}|::|[!=]~"),
3 comments

it's a fantastic feature in theory. in practice, it's imprecise and error-prone, and I believe these regular expressions are probably why. I hadn't looked at the implementation before, but I approached it from the other end: I set up a bunch of test cases, and I was pretty disappointed.

there were two disappointments. first, `git log -L` seems to prioritize tracking blocks of code over lines of code. that's just a design choice I disagree with, so it wasn't a big deal. but it also lost track of lines of code for me quite often, and produced a number of false positives to boot.

to be fair, I haven't tried using `diff=LANG` (per a comment below), and that might get more reliable results.

Yeah, this has been my experience too. Easily confused by common constructions in some codebases, and that can make it almost completely useless. I would happily sacrifice a lot of speed to get a difftastic level of precision.
I've attempted something similar to your ast-search tool, but it instead iterates through git history, pulls out the relevant text and then provides the diff to the user.

It's a tricky problem because it sits somewhere between text, where a function name could get renamed and it's obvious because it is textually similar, and an AST where 'similarity' is a difficult concept.

I struggled to make it usable, but of course there's a module to do half of it that I didn't find initially - https://pypi.org/project/pyastsim/

Interesting. Also I am surprised at how short the list is!