Have you developed open source tooling I can test out?
I deal with malware from an operational perspective. I do use ML based proprietaty solutions that are crazy good at picking up on bad things. As someone who doesn't understand a lot of the n-gram and ML specific language in your paper,am I correct in sayig your work would basically allow developing better or more accurate true positive detections?
I want to open source it, but the government owns the code and bureaucracy is slow. We've talked about it, but it will move at its own pace.
We think our tool will provide features that can improve AV systems to be better and more accurate, yes (see Figure 7)! We tested that in the last section of our paperLots of future work in that direction is planned.
What’s your top-k algorithm? I’ve used count sketches and heaps (I read heavy-keeper strictly outperforms count/count-min). I’ve also been told some variant of a hierarchical count sketch provides composability, but I’m not sure what that is.
(Actually, how do you compose? Is it the feature vector a union of all top-k features?)
Most of the work we actually don't use any fancy sketching. Just a GIANT table where we count hashes, and ignore collisions, making it super fast to count. Then we just use Quick-Select to find the top-k counts out of 2^31 buckets.
Second phase we technically use the Space-Saving algorithm, but its generally not needed - we could just use a hash-map and naively count (which we know is actually safe is the data is truly power-law distributed). The Space-Saving algo gives us some greater peace-of-mind though since real data doesn't have to obey our theories :)
I deal with malware from an operational perspective. I do use ML based proprietaty solutions that are crazy good at picking up on bad things. As someone who doesn't understand a lot of the n-gram and ML specific language in your paper,am I correct in sayig your work would basically allow developing better or more accurate true positive detections?