|
|
|
|
|
by eranation
15 days ago
|
|
Ran it on a subset of 10 of the 50 PRs in this benchmark https://codereview.withmartian.com - very good recall (~74%, e.g. found a lot of the golden issues) - not so good precision (~12%, e.g. lots of false positives) - the precision causes the F1 to tank (~20%, if this stays the same on the full 50 sample it would puts it almost last, even less than Kilo+Grok) |
|