Hacker News new | ask | show | jobs
by stevepike 857 days ago
Oh man, this brings me back! Almost 10 years ago I was working on a rails app trying to detect the file type of uploaded spreadsheets (xlsx files were being detected as application/zip, which is technically true but useless).

I found "magic" that could detect these and submitted a patch at https://bugs.freedesktop.org/show_bug.cgi?id=78797. My patch got rejected for needing to look at the first 3KB bytes of the file to figure out the type. They had a hard limit that they wouldn't see past the first 256 bytes. Now in 2024 we're doing this with deep learning! It'd be cool if google released some speed performance benchmarks here against the old-fashioned implementations. Obviously it'd be slower, but is it 1000x or 10^6x?

5 comments

Co-author of Magika here (Elie) so we didn't include the measurements in the blog post to avoid making it too long but we did those measurements.

Overall file takes about 6ms (single file) 2.26ms per files when scanning multiples. Magika is at 65ms single file and 5.3ms when scanning multiples.

So Magika is for the worst case scenario about 10x slower due to the time it takes to load the model and 2x slower on repeated detection. This is why we said it is not that much slower.

We will have more performance measurements in the upcoming research paper. Hope that answer the question

Is that single-threaded libmagic vs Magika using every core on the system? What are the numbers like if you run multiple libmagic instances in parallel for multiple files, or limit both libmagic and magika to a single core?

Testing it on my own system, magika seems to use a lot more CPU-time:

    file /usr/lib/*  0,34s user 0,54s system 43% cpu 2,010 total
    ./file-parallel.sh  0,85s user 1,91s system 580% cpu 0,477 total
    bin/magika /usr/lib/*  92,73s user 1,11s system 393% cpu 23,869 total
Looks about 50x slower to me. There's 5k files in my lib folder. It's definitely still impressively fast given how the identification is done, but the difference is far from negligible.
Do you have a sense of performance in terms of energy use? 2x slower is fine, but is that at the same wattage, or more?
That sounds like a nit / premature optimization.

Electricity is cheap. If this is sufficiently or actually important for your org, you should measure it yourself. There are too many variables and factors subject to your org’s hardware.

The hardware requirements of a massively parallel algorithm can't possibly be "a nit" in any universe inhabited by rational beings.
Totally disagree. Most end users are on laptops and mobile devices these days, not desktop towers. Thus power efficiency is important for battery life. Performance per watt would be an interesting comparison.
What end users are working with arbitrary files that they don’t know the identification of?

This entire use case seems to be one suited for servers handling user media.

File managers that render preview images. Even detecting which software to open the file with when you click it.

Of course on Windows the convention is to use the file extension, but on other platforms the convention is to look at the file contents

Browsers often need to guess a file type
Theoretically? Anyone running a virus scanner.

Of course, it's arguably unlikely a virus scanner would opt for an ML-based approach, as they specifically need to be robust against adversarial inputs.

I mean if you care about that you shouldn't be running anything that isn't highly optimized. Don't open webpages that might be CPU or GPU intensive. Don't run Electron apps, or really anything that isn't built in a compiled language.

Certainly you should do an audit of all the Android and iOS apps as well, to make sure they've been made in a efficient manner.

Block ads as well, they waste power.

This file identification is SUCH a small aspect of everything that is burning power in your laptop or phone as to be laughable.

Whilst energy usage is indeed a small aspect this early on when using bespoke models, we do have to consider that this is a model for simply identifying a file type.

What happens when we introduce more bespoke models for manipulating the data in that file?

This feels like it could slowly boil to the point of programs using magnitudes higher power, at which point it'll be hard to claw it back.

In general you're right, but I can't think of a single local use for identifying file types by a human on a laptop - at least, one with scale where this matters. It's all going to be SaaS services where people upload stuff.
We are building a data analysis tool with great UX, where users select data files, which are then parsed and uploaded to S3 directly, on their client machines. The server only takes over after this step.

Since the data files can be large, this approach bypasses having to trnasfer the file twice, first to the server, and then to S3 after parsing.

I've ended up implementing a layer on top of "magic" which, if magic detects application/zip, reads the zip file manifest and checks for telltale file names to reliably detect Office files.

The "magic" library does not seem to be equipped with the capabilities needed to be robust against the zip manifest being ordered in a different way than expected.

But this deep learning approach... I don't know. It might be hard to shoehorn in to many applications where the traditional methods have negligible memory and compute costs and the accuracy is basically 100% for cases that matter (detecting particular file types of interest). But when looking at a large random collection of unknown blobs, yeah, I can see how this could be great.

If you're curious, here's how I solved it for ruby back in the day. Still used magic bytes, but added an overlay on top of the freedesktop.org DB: https://github.com/mimemagicrb/mimemagic/pull/20
Many commenters seem to be using magic instead of file, any reasons?
magic is the core detection logic of file that was extracted out to be available as a library. So these days file is just a higher level wrapper around magic
thanks
From the first paragraph:

> enabling precise file identification within milliseconds, even when running on a CPU.

Maybe your old-fashioned implementations were detecting in microseconds?

Yeah I saw that, but that could cover a pretty wide range and it's not clear to me whether that relies on preloading a model.
> At inference time Magika uses Onnx as an inference engine to ensure files are identified in a matter of milliseconds, almost as fast as a non-AI tool even on CPU.
> They had a hard limit that they wouldn't see past the first 256 bytes.

Then they could never detect zip files with certainty, given that to do that you need to read up to 65KB (+ 22) at the END of the file. The reason is that the zip archive format allows "gargabe" bytes both in the beginning of the file and in between local file headers.... and it's actually not uncommon to prepend a program that self-extracts the archive, for example. The only way to know if a file is a valid zip archive is to look for the End of Central Directory Entry, which is always at the end of the file AND allows for a comment of unknown length at the end (and as the comment length field takes 2 bytes, the comment can be up to 65K long).

That's why the whole question is ill formed. A file does not have exactly one type. It may be a valid input in various contexts. A zip archive may also very well be something else.
FWIW, file can now distinguish many types of zip containers, including Oxml files.