I'm referring to "algorithmic" in its negative usage as short-hand for what you can see as of late with regard to Twitter, Youtube, and Facebook's suggestion engines where more complex sorting and filtering is happening than a simple ORDER_BY.
Sure, not the best short-hand, but it seems most people understood what I meant.
Only if you subscribe to everything - if you keep your follower list small, the individual experience doesn't change regardless of the overall size of the platform.
I'm not an PG user but there does not appear to be an index on tweet.time - and even if there was wouldn't it better to do ORDER BY tweet.id DESC? I assume ordering by primary key is going to be the fastest, and would (ideally) prevent maintaining an index on tweet.time.
Sure, but you're making some assumptions in doing so. For example, that the time and id columns will remain constant. It might be a decent assumption in the beginning, but once you start doing updates on the table, all bets are off.
Yup, as soon as you need to change ID schemes, you risk breaking query logic. Sure things like v1 UUIDs and Snowflakes (and ULIDs and so forth) try to maintain temporal ordering in their ID formats, but what if you need v4 UUIDs for better clustering in some hash-table or SHA-256 hashes for some sort of content addressing scheme?
Also, it's just a very premature optimization in a world where CLUSTERED INDEXes exist. The database engine doesn't have to cluster by primary key, it can cluster directly by time (or any other index) if you ask it to. The power of doing it that way is that you can flexibly change it based on real performance issues (how do your execution plans look?) and characteristics (are you read heavy or write heavy? which ones are your bottle necks?), whereas if your application makes assumptions about ID format it's a lot harder to on-the-fly tune queries that all need to be rewritten.
I am not sure IDs are exactly incremental in Twitter at this point. I think they are issued by blocks or something of that nature. Someone here could answer that far better.
Twitter calls them "snowflakes". It's a 64-bit ID of that consists of a tuple of sorts of a timestamp, machine code, sequence ID. They start with timestamps to mostly give them monoticity in the direction of time (ie, sort order them), but they definitely aren't simple sequence IDs.
With the query given, the optimiser can immediately figure out to get the latest record from the index and scan back through the index. If the index has included columns, it could scan the data straight off the index. Without the index you need to scan the entire table, sort it in memory, and then read off the top columns. If you were doing a top X query it would be more markedly faster by fetching less data from disk. But I think that query is getting all the records, but still it will be quicker by avoiding the in memory sort.
interesting, thanks for the insight. I haven't touched DB setup in many years, and even then was novice. The best person I knew told me to index if a column wasn't unique, but also wasn't something with only two or three choices. Sounds like I have more reading to do...
I'm certainly not an expert - I did a great DB course about 15 years ago and then used the skills every now and then since. I might not be up on the latest. And I am more of a SQL Server person. BUT... the main thing I learn is view actual execution plans, and see what is actually happening before adding indexes (unless it's an 'obvious' index).
> The best person I knew told me to index if a column wasn't unique, but also wasn't something with only two or three choices.
Yeah I think this is too broad advice, and you need to understand what you want to achieve. Mentally, choosing indexes is like choosing whether to use a hashtable vs. for loop vs. binary tree etc in an algorithm in code. There is not golden rule or "always use a hashtable if there are 100 entries" type thing. You just need to figure it out on a case by case basis. And usually there are only 2-3 tables in your DB worth a lot of effort in figuring it out!
So, in fairness I didn't dig through the code to see the column, but assumed some granularity like milliseconds? It was a genuine question on whether it would speed up the query at all if indexed since it is generally grabbing everything then sorting.
There are good faith arguments for applying smarter sorting / filtering to the stream.
For example, if someone follows two people and one person simply tweets at 1000x the rate of the other, I would argue that it would be better UX if the rare tweet from the other person gets more weight.
Another thing Twitter does is show you popular things that have been liked by the people you follow. It might be way too much to show you every thing that every person you follow has liked every time. But it can introduce some interesting content to your stream to see such second-tier content.
I don't see how promoted content plays into it since you can interpolate it anywhere.
If only they weren't inane. I think Twitter is probably the worst in terms of 'promoted content' because it always seems to serve up the same ad several days in a row and it's always something surprisingly off-topic. I'd be browsing my feed full of political news, movie trailers, and animal photos... and get an ad for ketchup. How on Earth could I confuse that for a real tweet or something my friends would retweet? A ketchup ad? Mental.
This doesn't make much sense to me. You don't need complicated algorithmic sorting and filtering to inject an ad space after every fifth tweet. Even the client can do that.
You see many more outlier tweets with thousands of likes when your feed is interspersed with "most liked by follows and follows of follows", as opposed to just "tweeted by follows". I think that improves the signal to noise ratio of the feed.
But yeah, I wish a simple timeline was the default, and twitter stopped switching away from it.
For anyone interested in this, google “hacker news gravity algorithm” to see an interesting example of how sorting was done here at one time (as I understand it the algorithm has evolved).
I was working on creating an rss reader with a reddit-like interface that allowed follows, voting, comments, etc. The computational expense of a fancy sort algorithm based on factors other than just one non-computed column is immense once you get to a decent number of users. I ended up removing the algo just to lower my hosting bills. Of course, there were other potential solutions, but for a side project, this was easiest.