|
|
|
|
|
by EdwardRaff
2517 days ago
|
|
Thanks! Yea, we "cheat" by restricting ourselves to the top-k & distributional assumptions. For our case, the low frequency grams are never used, so it makes sense. >So if I understand correctly... the application here is to take a large number of programs known to be infected with the same malware, and then run this to find the large chunks in common that will be more reliable as a malware signature in the future. You basically got it! We also find large chunks in benign files too. While just _one_ chunk is pretty indicative, performance is much better by using multiple. Thats where Logistic Regression (+ Lasso) comes in to tell us how important each chunk is to making a decision. Near term I think there are some NLP applications for this (though not out to 1024 grams!), and I'm hopeful extensions to this work will be useful to bioinformatics problems. I'm also interested in seeing what craziness people come up with now that large n-grams are an option! Everything I had done/learned before put n>6 as an immediate "why even think about it, you can't compute them" bucket. |
|
The reason being that grammar is intensely hierarchical, so that mere "linear" processing that n-grams do stops being useful beyond things like compound words or short sayings like "don't mind if I do!"