Some articles found by googling [1] [2] from two years ago describe this capability as "ultrasonic watermark" so it is not new. I think this is coming to light as Zoom has become popular with the pandemic. For a journalist wanting to sanitize audio I would think they need to remove anything higher than 15kHz.
Audio watermarking is old hat, and it’s FAR easier for Zoom than for say a music service, because people are used to imperfections/stuttering/blurring in their Zoom calls which can just be encoded watermarks.
Pasting a comment I found intersting & funny from one of the commenters of that article:
"...It is a strange thing that the real quality audio is now reserved for the pirates. This industry really knows how to hit a target."
Listening to the samples (I got a nice BeO over-ears headset that has very good performance), I also realized that Spotify gives me some noise, I also thought it's a codec/digital thing.. little do I know..
I'm a Spotify subscriber but I'd be the first to admit that Spotify's audio quality isn't great to begin with, even when set to high quality streaming. It's noticeably worse than uncompressed CD quality (ignoring CDs that were mastered from sources that were lossily compressed to begin with - what a great trend that was).
This isn't a complaint, more an observation: Spotify works really well if you're outside, in a car, or even in an office environment with plenty of low level background noise. It's not so great when you run it through a half-decent hifi in your own home. Still, good enough for casual listening. However, if you're paying attention, you'll notice the flaws easily.
So, some of that noise is probably just that: noise. But some of it will also no doubt be the watermark.
Fortunately I didn't pay that much for my setup. Amp and speakers are about 30 years old and were given to me by my stepdad about 20 years ago. Pretty much everything else is second hand from eBay and 25 - 40 years old (CD player, tuner, tape, EQ).
The biggest expense is the subwoofer, which I did buy new because used prices for a decent subwoofer are still pretty high, especially when you factor in the cost of petrol to go and collect the thing (most people don't want to post because they weigh a lot).
The only other new components are an inexpensive Bluetooth 5.0 receiver, the speaker cable (Bassface, which I want to say was about £2/metre - super-cheap by audiophile standard) and gold-plated banana plugs from RS components. All the interconnects are I think Amazon Basics.
So my total expenditure for the whole system is less than £1,000. Fully half of that is the subwoofer. Admittedly, that's still probably a fair bit by most peoples' standards, especially when it's perfectly possible to get very good sound from a hifi separates system for £250 or so (see Techmoan's video series on the topic, for example: https://www.youtube.com/watch?v=lSY1iZqH118), but it's chickenfeed for most audiophiles. Still, I'm definitely not one of those guys: it sounds more than good enough to me and I've no desire to fall any further into that particular black hole.
Except for one thing... I don't have a turntable. So what I'm probably going to do is buy a pair of SL1210s and a mixer to plug in to the system. I'm lucky enough to have a fair number of 12" singles from a freecycle "barn find" type situation a few years ago, and another time-consuming hobby to get through the rest of this pandemic will be no bad thing.
There is both a danger and a satisfaction to mostly cobbling together a nice sounding system from lots of second-hand parts though. The temptation for me is to do the same again with one or two of the other rooms in the house.
At first I thought they meant Bang & Olufsen, a high-end brand that prefixes all their products with Beo [0]. But I guess the industry is making Beryllium drivers now [1].
This is very poor opsec advice. Robust audio watermarking is standard technology for many years now, and can be licensed from multiple vendors. If Zoom (or any other actor) cares enough to watermark their audio, you must assume that it may be hard to detect and remove.
A vo-coder is probably the best off-the-shelf technology. Of course a challenge with making invasive changes to the audio (in order to defeat watermarking), is that people may claim that the audio is fake/misrepresented. Vocoded audio will not sound like the original speakers, and may have artifacts. Lipsync may also be slightly off. So one would have to be careful to communicate these limitations. Which the general public may not have much interest in understanding... Adversarial opponents may latch on to these things and use it to discredit the recordings.
An more conservative approach would be to transcribe the audio into text, and only offer the audio to (more) trusted parties for verification.
Reasonably effective stream watermarking happens every day and is done in the human vocal range with almost no listener impact.
In radio, Arbitron has a system working well within the lower audio range, even AM radio. AM is typically 5Khz bandwidth.
They use a spectral masking technique able to encode ID bits into streams that can be decoded with portable devices.
PPM Portable People Meter
Frankly, this kind of thing would go unnoticed by pretty much all listeners.
From the PDF I linked:
[...]all watermarking technologies use the well-known perceptual principle of “masking,” which
was first reported in the early 20th century and is a core technical basis for mp3, AAC, and a host of data-rate reduction
schemes.
In simple language, a loud burst of energy at one frequency will deafen the human auditory system to certain
other audio components at nearby frequencies for a period of time before, during, and after the loud signal.
Consider the following illustration: A tone burst at 1.1 kHz with an intensity of 0 dB will hide (make imperceptible) an
added signal at 1.11 kHz with a level of -30 dB for a period of about 10 ms before the burst and as much as 50 ms after the
burst. However, modern signal-processing techniques can still detect the existence of this added 1.11 kHz component even
though the ear cannot.
This is the basis of PPM and other similar watermarking technologies that use masking for
determining the frequencies and intensity of the data that can be added for the station-identifying watermark.
The PPM system constructs 10 spectral channels in the region from 1.0 kHz to 3.0 kHz. The original program audio
energy in each channel is evaluated for its ability to mask an added component. If that masking energy is insufficient,
nothing is added. Conversely, if the energy in a channel is large enough, a tone is injected, chosen from one of four
possible frequencies within the channel. For example, the channel centered at 1058 Hz might have one of the following
four frequencies injected: 1046, 1054, 1062, or 1070 Hz.
Each of the four frequencies represents 2 bits of information. If we assume that this process repeats at a 500 ms rate,
using all channels provides 40 bits per second or 2400 bits per minute of watermark code. Let’s further assume that a
radio station is credited for a listener if any code is correctly detected within a 3-minute interval. With the very large
number of encoded bits generated in 3 minutes (2400 x 3 = 7200 bits) and a station’s identification data needing perhaps
only 50 bits, there is massive excess capacity for redundancy, error correction, and for audio that does not have enough
high-frequency content for masking.
So if masking is used, I assume compressing the audio with any modern compression scheme from mp3 up should defeat that shouldn't it (because they drop masked signals to save bandwidth)?
Depends. The Arbitron system works through the HD Radio codec, which is a wavelet codec. It is basically hybrid mp3 type coupled with high frequency reconstruction on the receiver side.
Interestingly, that literally means fake signals on the receiving end above 8 to 10Khz! Was as low, and may still be as low as 5khz when used for AM. I have not kept up.
I could tell early on. It has improved a lot since then.
The Arbitron system appears robust. Noise, low signal quality, etc... do not generally impact it much. The effective bitrate needed is very low.
Given a larger sample of audio, it is likely to work.
A robust watermarking system will include some sort of error correction, so the answer is that it might, it depends on how much error it introduces.
A purpose built algorithm designed to thwart watermarking however is far more likely to be successful than a compression algorithm that is designed to maintain the integrity of the audio.
The phenomenon described by the quoted comment is called "temporal masking". There is "pre-masking", where a sound is rendered in-perceivable by a sound that _follows_ it (your "forgetting" case). And there is post-masking, where a sound is in-perceivable because of a masking sound that preceded it. And yes, this is due to inherent slowness / lack of temporal resolution in the auditory system.
Temporal masking widely exploited in all kinds of lossy audio compression (MP3, AAC etc), to remove the data that cannot be perceived anyway.
We just don't resolve detail to that temporal degree. You can't really "listen between" the periods of a 100 Hz sound, so being unable to recognize a 10 ms event preceding a much louder one is expected.
This is an entirely fair comment. And it's typical of my experience as well, and I have a fair amount related to audio, though not as extensive as yours.
My mind works differently when it comes to language and the scope of possible meanings is something I always consider relevant.
What concerned me here was someone taking the colloquial definition of "ultrasound" literally, and making assumptions that are not valid in this context at all.
What the word actually conveys is both a matter of subtlety and frequency.
Turns out, having read the entire discussion, both are relevant in terms of threat assessment, and thinking about what is said more deeply can have a positive impact on a discussion of this nature.
All of which is why I chose to point out what "ultrasound" actually does mean linguistically.
Edit: In my experience, such uses can and do happen. I personally allow for it and use context to parse. Where there is ambiguity, I generally won't dismiss it out of hand.
Subsonic comes to mind here. As does the question why the word did not appear regarding these watermarks.
The answer may just be someone with far less domain expertise attempting to communicate.
No, it means "beyond." Like you point out, "across" means something else. The Latin for "above" is supra or super.
> ultra-, prefix:
> 1. Signifying ‘lying spatially beyond or on the other side of’
> 2a. With adjectives, signifying ‘going beyond, surpassing, or transcending the limits of’ (the specified concept).
> Etymology: Latin ultrā beyond, employed as a prefix in the post-classical ultrāmundānus ultramundane, and the later ultrāmarīnus ultramarine, and ultrāmontānus ultramontane.
Your quick trip through the etymology triggered an opsec thought:
It may be worth a suggestive talk to expand how people take words.
A pop culture reference would be Daniel Jackson from the series, "SG-1"
We may often be constrained in our ability to understand and assess by our own preconceptions relating to language.
"Ultrasonic" was interpreted very differently by any number of us having this watermarking discussion. How often do we make assumptions about the possible field of play based on language basics?
How often do those fail to be sufficiently inclusive?
I bet it happens more than we realize.
Seems like a good basis for a DEFCON talk. "Where is Daniel Jackson when your team needs him?"
I was focused on the etymology; the actual usage of "ultrasonic" is generally confined to high pitches, not low.
Worthwhile point still, though I wouldn't have responded had the commenter not stated a specific incorrect definition. How does this connect to OPSEC though?
The relation goes right to threat and solution scopes. In this case, someone working from an incomplete definition may well also work within an incomplete set of greater assumptions.
There is what it could mean, what we take it to mean, and what it does mean.
Where those overlap or not could have a significant impact on behavior.
I wonder if this is another marketing gimmick similar to end to end encryption controversy they got into. I hope by ultrasonic they just mean beyond hearing and not really that watermark lives exclusively in ultrasonic frequency range.
Do they also talk about the process for identifying the participant who leaked the content based on the leaked recording? Do they need to retain the original copy of the recording to be able to extract the watermark?