|
This looks cool. I ran this on some web crawl data I have locally, so: all files you'd find on regular websites; HTML, CSS, JavaScript, fonts etc. It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text". Some woff and woff2 files it identified as "TrueType Font Data", others are "Unknown binary data (unknown)" with low confidence guesses ranging from FLAC audio to ISO 9660. Again, the `file` utility correctly identifies these files as "Web Open Font Format". I like the idea, but the current implementation can't be relied on IMO; especially not for automation. A minor pet peeve also: it doesn't seem to detect when its output is a pipe and strip the shell colour escapes resulting in `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the output into a vim buffer or similar. |
For crawling we have planned a head only model to avoid fetching the whole file but it is not ready yet -- we weren't sure what use-cases would emerge so that is good to know that such model might be useful.
We mostly use Magika internally to route files for AV scanning as we wrote in the blog post, so it is possible that despite our best effort to test Magika extensively on various file types it is not as good on fonts format as it should be. We will look into.
Thanks again for sharing your experience with Magika this is very useful.