Hacker News new | ask | show | jobs
by michaelmior 854 days ago
> the current implementation can't be relied on IMO

What's your reasoning for not relying on this? (It seems to me that this would be application-dependent at the very least.)

2 comments

I'm not the person you asked, but I'm not sure I understand your question and I'd like to. It whiffed multiple common softballs, to the point it brings into question the claims made about its performance. What reasoning is there to trust it?
It had 3 failures. How is that a sign it's untrustworthy? I'm sure all alternatives have more than 3 failures. You might be making assumptions about the distribution of successes and failures (GP didn't say how many files they tested to find those 3) or how "soft" they were. In an extreme case, they might even have been crafted adversarial examples. But even if not, they might have features that really do look more like some other file type from the point of view of the classifier even if it's not easily apparent to a human. Being strictly superior to a competent human is a pretty high bar to set.
> or how "soft" they were.

From the comment: It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".

That's pretty soft. Nothing "adversarial" claimed either.

> Being strictly superior to a competent human is a pretty high bar to set.

The bar is the file utility.

Those are only soft to a human. I looked at a couple and I picked them correctly but I don't know what details the classifier was seeing which I was blind to. Not to say it was correct, just that we can't call them soft just because they're short and easy for a human.

> The bar is the file utility.

It has higher accuracy than that. You would reject it just because the failures are different even though they're less?

Yes. Unpredictable failures are significantly worse than predictable ones. If file messes up, it's because it decided a ZIP-based document was a generic ZIP file. If Magika messes up, it's entirely random. I can work around file's failure modes, especially if it's one I work with often. Magika's failure modes strike at random and are not possible to anticipate. File also bails out when it doesn't know, a very common failure mode in Magika is that it confidently returns a random answer when it wasn't trained on a file type.
Your original statement was that having a couple of failures brings into question its claims about performance. It doesn't because it doesn't claim such high performance. 99.31% is lower than perhaps 997 out of 1000 or whatever the GP tested. Of course having unpredictable failures is a worry but it's a different worry.
> It whiffed multiple common softballs

I must have missed this in the article. Where was this?

...It's in the comment you were responding to. Directly above the section you quoted.
I understand that, but it wasn't clear to me where those examples came from.
It's pretty obvious from the whole comment that they're his own experience. Are you going anywhere with this or are you just saying things?
It provided the wrong file-types for some files, so I cannot rely on its output to be correct.

If you wanted to, for example, use this tool to route different files to different format-specific handlers it would sometimes send files to the wrong handlers.

Except a 100% correct implementation doesn't exist AFAIK. So if I want to do anything that makes a decision based on the type of a file, I have to pick some algorithm to do that. If I can do that correctly 99% of the time, that's better than not being able to make that decision at all, which is where I'm left if a perfect implementation doesn't exist.
Nobody's asking for perfection. But the AI is offering inexplicable and obvious nondeterministic mistakes that the traditional algorithms don't suffer from.

Magika goes wrong and your fonts become audio files and nobody knows why. Magic goes wrong and your ZIP-based documents get mistaken for generic ZIP files. If you work with that edge case a lot, you can anticipate it with traditional algorithms. You can't anticipate nondeterministic hallucination.

Seconding this.

Something like Magika is potentially useful as a second pass if conventional methods of detecting a file type fail or yield a low-confidence result. But, for the majority of binary files, those conventional methods are perfectly adequate. If the first few bytes of a file are "GIF89a", you don't need an AI to tell you that it's probably a GIF image.

Doesn't seem all that non-deterministic. I tested the vba.html example multiple times and it always said it was VBA. I added a space between </HEAD> and <BODY> and it correctly picked HTML as most likely but with a low confidence.

So I think we can say it's sensitive to mysterious features, not that it's non-deterministic. Still leads to your same conclusion that you can't anticipate the failures. But I don't think you can with traditional tools either. Some magic numbers are just plain text (like MZ) which could legitimately accidentally appear at the beginning of a plain text file, for example.

Where are you getting the non-determinism part from? It would seem surprising for there to be anything non-deterministic about an ML model like this, and nothing in the original reports seems to suggest that either.
Large ML models tend to be uncorrectably non-deterministic simply from doing lots of floating point math in parallel. Addition and multiplication of floats is neither commutative nor associative - you may get different results depending on the order in which you add/multiply numbers.
Addition and multiplication of floats are commutative.
> It would seem surprising for there to be anything non-deterministic about an ML model like this

I think there may be some confusion of ideas going in here. Machine learning is fundamentally stochastic, so it is non-deterministic almost by definition.