Hacker News new | ask | show | jobs
by jonesnc 972 days ago
Why does the randomization have to happen in the database query? Assuming there aren't any large gaps in the distribution of IDs, and if you know the max_id, couldn't you pick a random number between min_id and max_id and get the record whose ID matches that random number?
1 comments

Precisely because of the gaps. Tons of Wikipedia article ID's aren't valid for random selection because they've been deleted, because they're a disambiguation page, because they're a redirect, or they're a talk page or user page or whatever else.

My comment covered your suggestion already -- that's why I wrote "you encounter the same problem where the id gaps in deleted articles make certain articles more likely to be chosen".

Can't you just query the ID column and grab the entire list of valid IDs, put that into an array, store it, and pick random IDs from that?
That requires setting up an entirely different service and somehow keeping it in perfect sync with the database, and along with all of the memory it requires.

And you've still got to decide how you're going to pick random ID's from an array of tens of millions of elements that are constantly having elements deleted from the middle. Once you've figured out how to do that efficiently, you might as well skip all the trouble and just use that algorithm on the database itself.

>And you've still got to decide how you're going to pick random ID's from an array of tens of millions of elements that are constantly having elements deleted from the middle. Once you've

How so? you have array of only valid IDs [1,2,3]

Oh I thought this was a one-time thing, like a research paper. Doing it continuously in real time is much harder.