| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by EdwardRaff 3118 days ago

Hi, paper author here!

The dataset is small by AV standards, but we aren't an AV company. We can only use as much as real AV companies are willing to share with us. If you'd like to share more, we would be happy to take it :)

The model is fairly robust to new data, and we tested it with malware from a completely separate source than our training data - so there shouldn't be any share items like icons between the training set and the 2nd testing set. However, we aren't arguing that is of an AV quality today. The main purpose of this research was to get a neural network to train on this kind of data at all, as it is non trivial and common tools (like batch-norm) didn't translate to this problem space.

We are looking at the modification issue! I can't share any results yet since we have to go through pre-publication review, but the issue isn't unknown to us!

3 comments

tedivm 3118 days ago

Did you talk with VirusTotal? That's probably the largest dataset out there that isn't controlled by an AV company.

link

Fnoord 3118 days ago

VirusTotal is owned by Google since 2012. https://virusscan.jotti.org/ is another one, even older. It says 2004, but I met that guy in 2002 or 2003 and back then he had this up already. URL might've been different.

link

EdwardRaff 3118 days ago

I've tried. They are fairly unresponsive. Right now my advisor is trying to get an academic license to their system.

link

rbanffy 3118 days ago

What is the size of the Clamav dataset?

link

EdwardRaff 3118 days ago

I'm not sure I understand the question. We didn't use any data from ClamAV.

link

rbanffy 3117 days ago

It occurred to me it could be a useful source. I never actually looked into it, so I can't be sure it'd be useful, but since you are aware of it, it's a good indication it's not.

link

lqdc13 3118 days ago

Awesome. I hope I'm wrong and I'm looking forward to trying out your approach!

link

EdwardRaff 3118 days ago

Feel free to send an email if you have any questions when trying it! Since it is a static technique we don't expect it to become quite as good as what you could get with a dynamic approach, but we've been happy with our results thus far.

Our work so far has found that data quality used in training is the biggest factor in the performance you should expect. Which isn't surprising, but it seems to be a bigger problem in this space. Some of our first work was dealing with that issue and showing how critical it can be http://www.readcube.com/articles/10.1007/s11416-016-0283-1?a...

link