| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hexxiiiz 2028 days ago
	Went down this road recently. I wrote some python code to handle some big archives of music, pdfs, and other media, normalizing the names of everything. I decided to try and correct the extensions to indicate the file type properly. This turned out to be a little complicated. In most cases the libfile estimate was good (using the magic number) and provides an mimetype as an output. However, it sometimes overgeneralized the file type to something more general, or outright flattened it out to "binary data". To make this more robust, I used pythons mimetype library to infer the mimetype from the filename as a secondary source of the information. I then needed to use a set of heuristics to reconcile the two mimetypes: derived from libfile and from the file name. This works pretty well to identify consistent cases, getting them out of the way. When the libfile mimetype is precise it is usually safe to fix the extension, particularly if the difference is just audio or image format. Nonetheless, there are a lot of corner cases. If I were even more ambitious, the tough cases could probably be drilled down on further with some media metadata utilities. I am curious if someone has just worked this out already in the form of a library.