Except image formats and archival formats are composites (data+metadata). We have Exif for images, and you might be surprised by how much metadata the USTar format has.
With that reasoning almost every format is a composite, which doesn't sound like a useful distinction. Such metadata should be fine as long as the metadata itself is isolated and can be updated without the parent format.
My reasoning for Exif was that it is not only auxiliary but also post-hoc. Exif was defined independently from image formats and only got adopted later because those formats provided extension points (JPEG APP# markers, PNG chunks).
You've got a good point that there are multiple types of metadata and some metadata might be crucial for interpreting data. I would say such "structural" metadata should be considered as a part of data. I'm not saying it is not a metadata; it is a metadata inside some data, so doesn't count for our purpose of defining a composite.
I also don't think tar hardlinks are metadata for our purpose, because it technically consists of the linked path instead of the file contents and the information that the file is a hardlink, where the former is clearly a data and the latter is a metadata used to reconstruct the original file system so should be considered as a part of larger data (in this case, a logical notion of "file").
I believe these examples should be enough to derive my own definition of "composite". Please let me know otherwise.