Hacker News new | ask | show | jobs
by pacala 2563 days ago
I wish I understood what problems people solve day in and day out such that they need to call IO in the middle of Dijkstra’s algorithm. Most "business logic" ought be pure functions over persistent data structures with no IO other than the occasional logging. At the last possible moment, 5% of the code wires in IO in some boilerplaty way. Is that sufficiently pervasive to worry about it?
6 comments

> what problems people solve that they need to call IO in the middle of Dijkstra’s algorithm.

I worked on several problems of this nature at Twitter in 2012. Hopefully there’s a better way to solve them in 2019...prolly not, but maybe.

say you want to find the median of the number of followers a person on twitter has. so that should be easy - make 1 dataframe with follower count of each bloke and call median() - well, there’s some 300,000,000 blokes, so not that easy :) You have to make a dataframe via ETL - reading & writing to disk 100s of times, loading a few thousand users each time, distributed median computation. so a silly sub-second median query took 2 months to code up & debug & ran for a few hours due to so much IO.

another much harder problem - you want to find the median number of hops between one user & another. so now you have 300m x 300m tuples as your result - where & how to store them is in itself a monstrous challenge. but how the heck do you even compute the result ? you read in one tweet from john to steve, so that’s 1 hop from john to steve & viceversa. you then read a second tweet from steve to mary, so that’s 1 hop from steve to mary & viceversa, 2 hops from john to mary & viceversa. in this manner you read 100s of billions of tweets & keep updating hopcount. somewhere in there john sends mary a tweet - oh fuck now the hopcount is 1, not 2. this will then change lots of other hopcounts. in theory there are nice graph algos for this sort of thing. but in reality, your data is billions of tweets constantly increasing, stored in distributed compute clusters across the planet & just getting a handle on all this can be a 6 month project for some lucky scientist who got to work on this.

> I worked on several problems of this nature at Twitter in 2012. Hopefully there’s a better way to solve them in 2019...prolly not, but maybe.

Okay, so Twitter has scale. To a first, second and third order engineering approximation--nobody else does.

If you are a mere mortal writing practically anything, pull it all into memory, operate on it to create another copy, destroy the original copy (or let GC kill it).

Embedded programmers might get a pass on this given limited memory (32K RAM)--but that same kind of attitude is getting more and more essential as you start getting Big/Little core mixes on the same chip.

Computers are mind-bogglingly powerful.

I have been completely stunned at how many transactions Nginix+Django+PostgreSQL can actually handle before you need to start thinking about "scaling".

> I wish I understood what problems people solve day in and day out such that they need to call IO in the middle of Dijkstra’s algorithm.

For Dijkstra, imagine a very large graph that cannot fit into memory, where you'll need to go out to disk or network to compute the distance of two nodes (or fetch part of the graph, etc).

The business logic many (perhaps most) programmers work on is primarily occupied with gluing together multiple state stores (databases, caches, message queues), running some very simple computations, and writing the output to some IO sink (often a web response).

Sure, you could extract the computations part out. But that barely moves the needle on testability/cleanliness, because most of the business (or business-value-driving) logic is the data flow management--highly stateful IO coordination.

Yes. So much business logic requires IO. Pure business logic is a great dream, but it's a dream.
I/O, both disk and network, can happen almost anywhere because data doesn't always fit in memory, may be streamed from a remote source, and I/O handling cannot always be deferred indefinitely. A significant part of systems software design is accounting for this reality.
One of the primary issues I run into this is that some parts of the domain logic determine what data you need from the database and this logic can't be easily moved into sql(or however you're specifying what data you need back from the datasource).