| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dxbydt 2563 days ago

> what problems people solve that they need to call IO in the middle of Dijkstra’s algorithm.

I worked on several problems of this nature at Twitter in 2012. Hopefully there’s a better way to solve them in 2019...prolly not, but maybe.

say you want to find the median of the number of followers a person on twitter has. so that should be easy - make 1 dataframe with follower count of each bloke and call median() - well, there’s some 300,000,000 blokes, so not that easy :) You have to make a dataframe via ETL - reading & writing to disk 100s of times, loading a few thousand users each time, distributed median computation. so a silly sub-second median query took 2 months to code up & debug & ran for a few hours due to so much IO.

another much harder problem - you want to find the median number of hops between one user & another. so now you have 300m x 300m tuples as your result - where & how to store them is in itself a monstrous challenge. but how the heck do you even compute the result ? you read in one tweet from john to steve, so that’s 1 hop from john to steve & viceversa. you then read a second tweet from steve to mary, so that’s 1 hop from steve to mary & viceversa, 2 hops from john to mary & viceversa. in this manner you read 100s of billions of tweets & keep updating hopcount. somewhere in there john sends mary a tweet - oh fuck now the hopcount is 1, not 2. this will then change lots of other hopcounts. in theory there are nice graph algos for this sort of thing. but in reality, your data is billions of tweets constantly increasing, stored in distributed compute clusters across the planet & just getting a handle on all this can be a 6 month project for some lucky scientist who got to work on this.

1 comments

bsder 2563 days ago

> I worked on several problems of this nature at Twitter in 2012. Hopefully there’s a better way to solve them in 2019...prolly not, but maybe.

Okay, so Twitter has scale. To a first, second and third order engineering approximation--nobody else does.

If you are a mere mortal writing practically anything, pull it all into memory, operate on it to create another copy, destroy the original copy (or let GC kill it).

Embedded programmers might get a pass on this given limited memory (32K RAM)--but that same kind of attitude is getting more and more essential as you start getting Big/Little core mixes on the same chip.

Computers are mind-bogglingly powerful.

I have been completely stunned at how many transactions Nginix+Django+PostgreSQL can actually handle before you need to start thinking about "scaling".

link