Hacker News new | ask | show | jobs
by cryptarch 3464 days ago
It would cost about $30m a year if you tailor the system to flagging specific data for storage and don't naively store every moment (e.g. you scrap silent moments and use VBR encoding).

Storing a year's worth of 96kbps audio costs 380GB. If you don't record silence and you assume the people around an Alexa are only speaking for at most 4 hours a day on average, that goes down to 76GB a year.

So if you then assume 5m Alexa's are active at any given point in time that works out to 380k PB. Ok, that doesn't work yet.

However, if you then layer on a flagging system, where only certain users' full record is stored, or only "suspicious incidents" are stored, and you get this down to only flagging 0.1% of all data, you arrive at 380PB of storage.

Amazon Glacier costs about $88.000 a year per PB, but there's a profit margin included in that, so I'll assume it costs Amazon just $75k a year.

In conclusion, it would cost Amazon about $28.5m a year to run such a system. That's certainly within the realm of possibility and of what LE/SIGINT clients would pay; I assume the NSA would gladly pay that sum x100 for that capability. Sounds like it'd be booming business for Amazon.

5 comments

I think you're orders of magnitude off in marginal cost estimates for glacier users. Datacenters are being built out for a small number of commercial users (e.g. Amazon's core business) and the size of modern HDDs would lead me to estimate that storage is about free in a modern datacenter, the scarce resource is disk-time for read/write operations. That is, projects like glacier let Amazon sell disk that would have otherwise been stranded.

It is also the case that a consumer level service like glacier presumably has more redundancy than what might be needed for best-effort storage of these recordings, where losing any fraction of them wouldn't really be a problem.

I'm not in the datacenter business, so I've been conservative for lack of experience with storage at PB scale.

I've chosen to err on the side of estimating it to be more expensive, because I think that makes the end result more convincing:

30m is chump change for parties like Amazon, and in reality it'll cost significantly less. 1m might well do. Maybe it's less still. You could combine flagging users with flagging low-certainty or keyword-containing transcriptions.

Either way, you don't need collusion with intelligence parties, just an unscrupulous or naive exac at Amazon that thinks the data might be worth a lot for training future learning models. Of course the more sinister but legal reselling to government agencies is a financially attractive option as well.

I really like the math here, but isn't this a bit pointless? The system wants to parse meaning from audio; storing just the text it parsed is a lot smaller. Store just the text and whatever machine learning score of how probable the text is correctly parsed and that sounds like something prosecutors would love to bring into court: "Please read this line and let's see what score you get . . . "
For improvements they'd store the raw input so that when a mistake happens they can manually try to figure out why the machine got it wrong (e.g. a hi-hat was hit while they were saying "deuce" so it sounded like "douche").
The raw speech would still be very useful as training data for new and improved models
It could also store compressed voice waveforms in such a way that any reproduction from the compressed data would sound horrible but would be at least somewhat intelligible to human listeners.

1200 bits per second is almost enough for toll-quality speech -- and I'm referring to the state of the art a few years ago. Speech codecs are probably better now. But let's stick with 1200 bps. That's enough to store continuous speech in the vicinity of the device for a year, using only about 5 GB.

My guess is that if you cared only about intelligibility and not fidelity, you could do the job with 10%-20% of that space.

So yes: Alexa could easily be collecting and storing a vast amount of data that isn't immediately transmitted or used.

96kbps is pretty high for voice. You could get away with 48 or less.
64kbps would be sufficient to hear back noise and whisper, so that would make more sense.
Way less with modern speech codecs. Even Opus at 32kbps would be overkill for the required quality.
Opus is quite usable for speech data down to 8kbit/s, even 6kbit/s is mostly understandable. At 10-12kbit/s you have good quality voice recordings.
Based on the costs of disks alone, their cost per PB is actually around $35k. They likely get a volume discount, so we can lower that estimate even more and say $25-30k. Bandwidth is essentially free even though they charge ridiculous amounts of money for it on their services. You can get a 10Gbps link for as low as $2,000 if you buy it on-net and in bulk, Amazon probably gets it even cheaper. So ~3PB/month for $1,000/month.
Pointer? My understanding was that economy 10Gbps transit would be closer to $4-5k/month.
http://www.he.net/

"Get BGP+IPv6+IPv4 for $0.25/Mbps!"

I thought it was HE, but it must have been someone else that had a 10Gig deal for $2,000. Either way, that's for a single 10Gig link. If you're buying 100Gb - 1Tbps like Amazon is, you're probably getting an even better deal.

We just signed a contract with Level 3 for slightly more than the price you mentioned, but they had to build into us; which costs them ~$120k out of pocket, thus the higher price.

Actually you really only need to store about 1 month of data. If the police request something Amazon can lock the account from auto-deletion.
How so?

What about abduction cases, inside trading, tax fraud, drug and human smuggling? There it could help to have data from months ago, so any newly discovered targets instantly come with a bunch of evidence.