Hacker News new | ask | show | jobs
by spullara 1264 days ago
The basic idea of the system was to scan a reverse chronologically ordered list of "user id, tweet id", filtering out any tweet whose user wasn't in the follow set (or sets in the case of scan sharing) until you retrieved enough tweets for the timeline request. There are a bunch of variants in the patent, but that is the basic idea. At the time, I estimated that Twitter was spending 80% of its CPU time in the DC doing thrift/json/html serialization/deserialization and mused about merging all the separate services into a single process. Lot's of opportunity for optimization.
2 comments

Interesting, 80% seems a bit on the higher end nowadays though? For example, Google quantified this as the "datacenter tax" and through their cluster wide profiling tooling saw that it was 22-27% of all CPU cycles (still a huge amount). They go a different route and suggest hardware accelerators for common operations. Datacenter tax was defined as:

"The components that we included in the tax classification are: protocol buffer management, remote procedure calls (RPCs), hashing, compression, memory allocation and data movement."

https://static.googleusercontent.com/media/research.google.c...

This was back when there was 0 encryption, 0 compression, and using thrift and there is little actual business logic.
Could you give an insight into the reasons that such a system never replaced the existing implementation?
It is extremely difficult to change out data formats.