|
|
|
|
|
by tytso
1412 days ago
|
|
Oh, you can certainly do big projects. My project[1] spanned 3 departments, and involved dozens of engineers, and required that we work with multiple hard drive vendors (our first two partners for Hybrid SMR were Seagate and WDC) on an entirely new type of HDD, as well as the T10/T13 standards committees so we could standardize the commands that we need to send to these HDD's. So this was all a huge amount of "new shit" that was not only new to Google, it was new to the HDD industry. You just have to have a really strong business case that shows how you can save Google a large amount of money. [1] https://blog.google/products/google-cloud/dynamic-hybrid-smr... [2] https://www.t10.org/pipermail/t10/2018-September/018566.html On the production kernel team, colleagues of mine worked on some really cool and new shit: ghOSt, which delegates scheduling decisions to userspace in a highly efficient manner[3]. It was published in SOSP 2021/SIGOPS [4][5], so peer reviewers thought it was a pretty big deal. I wasn't involved in it, but I'm in awe this cool new work that my peers in the prodkernel team created, all of which was not only described in detail in peer-reviewed papers, but also published as Open Source. [3] https://research.google/pubs/pub50833/ [4] https://www.youtube.com/watch?v=j4ABe4dsbIY [5] https://dl.acm.org/doi/10.1145/3477132.3483542 We have some really top-notch engineers in our production kernel team, and I'm very proud to be part of an organization has this kind of talent. |
|
For example:
RePD is at just wrong level at all. It should have been at CFS/chunk level and thus benefit other teams as well.
BigStore stack is beyond bizarre. For years there were no object-level SLOs (not sure if there are now), which meant that sometimes your object disappeared and BigStore SREs were "la-la-la, we are fully within SLO for your project". Or you would delete something and your quota would not get back, and they would "or, Flume job got stuck in this cell, for a week...".
Not a single cloud (or internal, for that matter) customer asked for a "block device", they all want just to store files. Which means that cloud posix/nfs/smb should have been worked on from the day 1 (of cloud), we all know how it went.