Hacker News new | ask | show | jobs
by EdwardRaff 2517 days ago
Thanks! Yea, we "cheat" by restricting ourselves to the top-k & distributional assumptions. For our case, the low frequency grams are never used, so it makes sense.

>So if I understand correctly... the application here is to take a large number of programs known to be infected with the same malware, and then run this to find the large chunks in common that will be more reliable as a malware signature in the future.

You basically got it! We also find large chunks in benign files too. While just _one_ chunk is pretty indicative, performance is much better by using multiple. Thats where Logistic Regression (+ Lasso) comes in to tell us how important each chunk is to making a decision.

Near term I think there are some NLP applications for this (though not out to 1024 grams!), and I'm hopeful extensions to this work will be useful to bioinformatics problems.

I'm also interested in seeing what craziness people come up with now that large n-grams are an option! Everything I had done/learned before put n>6 as an immediate "why even think about it, you can't compute them" bucket.

1 comments

I've done a lot of work with n-grams in NLP and in my experience it's only useful up to 4-6 words at a time, unless you're trying to index proverbs.

The reason being that grammar is intensely hierarchical, so that mere "linear" processing that n-grams do stops being useful beyond things like compound words or short sayings like "don't mind if I do!"

Oh, I totally agree with everything you've said! Some collaborators I'm chatting with have more niche NLP applications where larger n would be valuable though. I don't want to go into too much detail yet since it's their project idea, and not sure on their comfort level on blasting it out to the public yet :)