| It looks like you got lucky and this proprietary format is nothing more than standard MIDI file concatenated together with perhaps some additional data that you are able to ignore +/- some header patch. Frankly this barely qualifies as reverse engineering, at least it represents some trivial case, I mean I'm happy it was easy, but reverse engineering just rarely ever works out so straightforward. And I would expect someone with competence in scripting language of choice to pop out that script which is a loop and file IO in a few minutes, not hours (assuming it is even correct). And if they have a basic experience working with binary files should know how to google the necessary info about MIDI in seconds. However looking at the transcript I am also confused because it says (correctly): MIDI files typically start with the header "MThd" followed by the header length, format type, number of tracks, and division. It goes on: "Once a MIDI section is found, we'll extract it according to the MIDI file structure". OK.
But the script does NOT do that it reads 4 bytes starting from offset 8 as a 32-bit big endian "length" which is not "according to the MIDI file structure". The standard format is 2 bytes for a format specifier (AKA type) (0, 1, or 2), and then 2 bytes for the number of tracks. ie, this is wrong in some way: # Read the MIDI header and length (14 bytes in total: 'MThd' + 4 bytes for header length + 6 bytes header data)
midi_chunk = io.read(14)
# Extracting the length of the MIDI data from the header
midi_data_length = midi_chunk[8..11].unpack('N').first
So either the proprietary format you're dealing with actually does have a variation on the header of the embedded MIDI file.
If that's the case, I would have to deduct points from ChatGPT because I would expect a competent developer to comment/document this fact, no where in the transcript is this stated.The other possibility I can see is that if your file is a bunch of standard Type 1 MIDI files, the unpack/parse is going to read that as 65536 + some small amount and will extract files that are all around that size. Since the next step is to look for another MThd magic it will just gleefully resync (I assume these are small segments), but you will end up skipping a whole bunch of files and they will be unceremoniously tacked onto others (which will just be ignored in many players). So what did it end up being?
If it was the second case, I would also be suspicious that a first crack LLM follow-up "fix" isn't subtly wrong and prone to false splits. On further thought, how could it be the first case? If it were the outputted files are not standard MIDI. So something is fucked here. Either you have something totally broken or you have further follow-up and we have to believe it is not subtly broken. "There was a whole lot of reading the MIDI spec, searching for strings in a hex viewer and calculating values in a hex to decimal calculator." One pearl I would lend in relation to this: use your REPL, that is a productivity accelerator. I am also sincerely interested in examples of LLMs reverse engineering something with compression or encryption or some checksum, or like some actual complicated structure that has to be teased out (this is something humans do all the time), maybe something that is most easily solved by cracking open the compiled parser, I'm not saying they can't do it, but plainly put this example is too trivial to be interesting and frankly barely qualifies as reverse engineering at least insofar as some sort of RE Turing Test analogue. ---- If the format works the way I think it does (and this is based on nothing more than general experience and this thread, so give me a break), the only robust way to deal with this is to either figure out where in the proprietary data some type of length field is, and clearly ChatGPT was not going in that direction, nor do I believe it would be able to divine that information from a file upload. Or to use this slightly wonkier method but actually read every MIDI chunk header, since standard MIDI has no total file size length encoded in it. The loop should be: look for MThd, read the NEXT 4 bytes for the length, skip, read and write out chunks (ie 4 byte magic followed by 4 byte length), split when chunk type not seen (that's what makes this a bit fragile, but its probably good enough). If you just look for MThd, you'll split if the MIDI data has an 'MThd' in it. |
It's distinctly possibly that you're simply "better" at reverse engineering than I am, which really just means that you might do it frequently and I might do it a few times a decade. This isn't going to keep me up tonight, because my identity isn't tied to being someone who reverse engineers things.
That said, I am pretty thrilled with this solution. I launched a web-enabled version last night and so far about 1100 people have used it to convert 6800 files after I replied to some posts on relevant musician forums around the web.
In my defense, what you're not taking into any consideration is that until 48 hours ago, I'd never looked at the MIDI spec or opened a MIDI file, before. You clearly have a huge amount of domain knowledge that I don't pretend to have.
I also, shocking as it may seem, haven't worked with binary formats in over a decade. I'm a web developer. Binary formats aren't an alien mystery to me, but all of the tools for working with them had to be re-learned as I was working on this.
Anyhow, don't fall into the trap of equating typing speed with the time it takes to learn a domain and consider (design) an approach. If I could think at the speed I can type, John Carmack would have nothing on me.
In the end, I absolutely did get lucky. The proprietary format was, as you proposed, a bunch of 1 track/format 0 MIDI files, bounded by hierarchy metadata that was discarded.