Won't they need some documents that combine whale speech noises with human words to bridge the gap? Otherwise they are making comments that are just word-like or sound-like fragments.
I don't think this is strictly needed. An English dictionary may seem pointless because it defines every word using only other English words. But the meaning is contained in the _relationship_ between the words.
I'm sure you've seen the example of word vectors that captures some of this meaning. king - man + woman = queen
In Spanish,
rey - hombre + mujer = reina
The _relationship_ between "king" and "queen" in English may look close enough to the _relationship_ between "rey" and "reina" in Spanish, allowing you to bridge the gap between the two languages, even if they are entirely disconnected and you've never seen a direct translation between them.
If you had enough recordings, you could (I think) build weights based _solely_ on whale speech. Humans wouldn't be able to understand the weights, and the word vectors in that model wouldn't match the word vectors in an English model, but I suppose there's a chance that vectors might be similar? I don't know. I think you'd have to be very good at both linguistics and also AI to know.
I'm sure you've seen the example of word vectors that captures some of this meaning. king - man + woman = queen
In Spanish, rey - hombre + mujer = reina
The _relationship_ between "king" and "queen" in English may look close enough to the _relationship_ between "rey" and "reina" in Spanish, allowing you to bridge the gap between the two languages, even if they are entirely disconnected and you've never seen a direct translation between them.