Hacker News new | ask | show | jobs
by twisteriffic 854 days ago
> The incident was caused by a third-party caching client library that was recently integrated into our system. This client library received unprecedented load conditions caused by devices coming back online all at once. As a result of increased demand, it mixed up device ID and user ID mapping and connected some data to incorrect accounts.

That seems like enough of a line of bullshit to steer me away from ever using wyze.

11 comments

Do you think the issue was something else? "People randomly see other people's content" is an issue that would immediately make me think some issue with caching is the culprit.

Given their openness in the rest of the communications, I don't see why they would make this part up.

Edit: Of course, I'm also curious what the actual bug was. A discussion below is suggesting several plausible ways (e.g. concurrency issues, insufficient entropy in some key) how a problem could happen under load (although many of these would also lead to the problem happening with less load, just much less often).

> Do you think the issue was something else?

No, I'm not questioning whether or not it was a caching issue. I'm taking exception to the lack of accountability. They chose the library. They (probably) chose to ignore a documented or common failure mode of caching systems through either poor choice of key or lack of synchronization. They've obviously designed their infrastructure in a way that isn't resilient to its current level of usage (cold start is a normal part of software's lifecycle).

They could have chosen to own that, but instead they blamed everyone else. That's not a sign of a trustworthy service provider.

It's not even that: the quoted language doesn't even blame the library - it appears to blame increased load.

"As a result of increased demand, it mixed up device ID" - no, it mixed up IDs as a result of some sort of a concurrency bug. I don't understand the point of deflecting this far.

Likely to be a multi-threading issue; my bet is the cache client wasn't thread-safe. I've seen this in some apps before and the solution was to turn off multi-threading while we debug the library that was causing the issue.
This is the answer!
> I'm taking exception to the lack of accountability.

I'll bet money that their statement was run through legal and stripped of all possible blamey statements.

> They could have chosen to own that, but instead they blamed everyone else. That's not a sign of a trustworthy service provider.

I agree. Companies need to own up to their fuckups, even with legal tells them that it can hurt. Because all companies will fuck up; how they handle it is the differentiator.

> I'm also curious what the actual bug was

Hardware. Rowhammer-type effects occurring accidentally under sudden load spikes. The hardware has just got too dense.

(I should clarify this is speculation, but reading the recent article included here on sudo using special maximum-distance bitfields to hold state internally (https://news.ycombinator.com/item?id=39165342)... it must be a problem that's being observed in the wild)

I can't imagine that happening with a sufficient frequency. A system making such mistakes so often would just be too unstable to keep an uptime >1h.
With the 'cattle not pets' mindset that pervades modern development is the lifespan of ephemeral cache VMs that closely monitored? They get spun up and down on demand in most architectures. I can see this being an edge case failure when the system is trying to scale up, the existing VMs are getting absolutely hammered, the hypervisor is trying to start up new ones, memory pressure and iops on the existing ones are maxed out...

It just seems like the most obvious root cause to me, a single bit-flip in a hashed value is going to give you the wrong result data without any other error because the hash value is already essentially heavily compressed, meanwhile the hash table is almost certain to be 100% stored in memory and very heavily accessed from multiple directions in a read/write manner.

No. That's not how that works.
How so? I've seen caching clients exhibit some really weird behaviour under heavy load. It's not beyond the pale that, eg, the caching library doesn't do proper locking before writing, resulting in writes stomping all over each other.

Caching is normally read heavy, not write heavy, so it's plausible it wouldn't be something you'd see much under typical operation. After an outage, they'd be dealing with a thundering herd level of traffic as everything tries to reconnect, that'd be very different from normal write loads, even different than the write load they'd have seen when they first enabled caching.

Yes but either the library is seriously bugged (like, expecting writes to be ordered and screwing up things if it gets too many writes for different objects at the same time) or there was some serious bug in their implementation. Anyway the attitude and the message passed in the communication seems like handwashing to me. I might be too cynic, though.
How else would you say a 3rd party library had a bug under heavy load? 1. You don't want a defamation lawsuit your way. 2. If it was vendor code, you have a contract that may be under a NDA. 3. If it was a vendor, lawyers, lots and lots of lawyers, they likely had to say the minimal amount. The fact they sent out communications for each type of incident in such a short time was great.
The problem is how much they're pointing fingers at the library in the first place.
I might be splitting hairs, but they say that the incident was "caused by a third party library" when in fact, the incident was caused by insufficient testing on their part.

It sounds like they're trying to shift blame for the incident but then they try to pat themselves on the back for all the effort they put into security. It comes across as dishonest.

Technical details are appreciated but they should've emphasized that this is their own fault. Bonus points if they commit to at least consider E2EE which would sidestep the issue.

Yeah I would very much likely to know what caching library has a failure mode of returning content for the wrong keys, that seems pretty bad if not a highly suspect explanation
Same thing happened to OpenAI. Will you steer clear of OpenAI forever as well?

https://news.ycombinator.com/item?id=35294082

... Yes?
The whole thing points at everyone but themselves... "Originated at AWS" then "caused by a caching library"

Very little ownership on Wyze's side.

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

And off-by-one errors
...and off-by-one errors!
I would at least want to know the client library so I can never use it for anything. Also never trust the client and I hope they don't mean the actual client app such that it can access other user id without server validation...
Coincidentally, I just cancelled my wyze service because the product and support are so terrible. I wanted a simple way to see if there was a package on my doorstep but instead I got something that alerts me when any dog, person, or vehicle goes down my street, and all I’ve gotten from support is robotic responses suggesting I update my firmware and ignoring my direct questions, running out the clock on my ability to return the thing. At this point I’m not surprised their engineering is bad and amused that it’s caused two different security incidents.
Device id and user ids are non unique?
hash collisions?
I'd bet it's this, plus something even stupider like hashing a connection timestamp millisecond as the "uniqueness" of the hash. I've seen a lot of terrible code implementations that assume that there will never be two clients connecting in the exact same millisecond
This sounds like a pretty decent guess to me. I bet you are right.
Sounds like its redis-py again...
sounds like a hashing function with insufficient entropy. "increased demand" would lead to a higher likelihood of hash collisions.
Not sure I follow. Hash functions don't require entropy, and a hash collision in a hash map shouldn't cause incorrect data to be returned (it just makes them less efficient).
I think that they are saying that the output space, i.e. the list of all possible hashes, is too small. Thus, IDs 1234 and 5678 lead to the same hash.

The collision is not in the insertion into the hash map but rather in the look up.