|
|
|
|
|
by pacala
2563 days ago
|
|
I wish I understood what problems people solve day in and day out such that they need to call IO in the middle of Dijkstra’s algorithm. Most "business logic" ought be pure functions over persistent data structures with no IO other than the occasional logging. At the last possible moment, 5% of the code wires in IO in some boilerplaty way. Is that sufficiently pervasive to worry about it? |
|
I worked on several problems of this nature at Twitter in 2012. Hopefully there’s a better way to solve them in 2019...prolly not, but maybe.
say you want to find the median of the number of followers a person on twitter has. so that should be easy - make 1 dataframe with follower count of each bloke and call median() - well, there’s some 300,000,000 blokes, so not that easy :) You have to make a dataframe via ETL - reading & writing to disk 100s of times, loading a few thousand users each time, distributed median computation. so a silly sub-second median query took 2 months to code up & debug & ran for a few hours due to so much IO.
another much harder problem - you want to find the median number of hops between one user & another. so now you have 300m x 300m tuples as your result - where & how to store them is in itself a monstrous challenge. but how the heck do you even compute the result ? you read in one tweet from john to steve, so that’s 1 hop from john to steve & viceversa. you then read a second tweet from steve to mary, so that’s 1 hop from steve to mary & viceversa, 2 hops from john to mary & viceversa. in this manner you read 100s of billions of tweets & keep updating hopcount. somewhere in there john sends mary a tweet - oh fuck now the hopcount is 1, not 2. this will then change lots of other hopcounts. in theory there are nice graph algos for this sort of thing. but in reality, your data is billions of tweets constantly increasing, stored in distributed compute clusters across the planet & just getting a handle on all this can be a 6 month project for some lucky scientist who got to work on this.