Hacker News new | ask | show | jobs
by jamroom 2399 days ago
I'm not an PG user but there does not appear to be an index on tweet.time - and even if there was wouldn't it better to do ORDER BY tweet.id DESC? I assume ordering by primary key is going to be the fastest, and would (ideally) prevent maintaining an index on tweet.time.
3 comments

Sure, but you're making some assumptions in doing so. For example, that the time and id columns will remain constant. It might be a decent assumption in the beginning, but once you start doing updates on the table, all bets are off.
Yup, as soon as you need to change ID schemes, you risk breaking query logic. Sure things like v1 UUIDs and Snowflakes (and ULIDs and so forth) try to maintain temporal ordering in their ID formats, but what if you need v4 UUIDs for better clustering in some hash-table or SHA-256 hashes for some sort of content addressing scheme?

Also, it's just a very premature optimization in a world where CLUSTERED INDEXes exist. The database engine doesn't have to cluster by primary key, it can cluster directly by time (or any other index) if you ask it to. The power of doing it that way is that you can flexibly change it based on real performance issues (how do your execution plans look?) and characteristics (are you read heavy or write heavy? which ones are your bottle necks?), whereas if your application makes assumptions about ID format it's a lot harder to on-the-fly tune queries that all need to be rewritten.

I am not sure IDs are exactly incremental in Twitter at this point. I think they are issued by blocks or something of that nature. Someone here could answer that far better.
Twitter calls them "snowflakes". It's a 64-bit ID of that consists of a tuple of sorts of a timestamp, machine code, sequence ID. They start with timestamps to mostly give them monoticity in the direction of time (ie, sort order them), but they definitely aren't simple sequence IDs.
Assuming that every tweet.time is unique, what benefit would you gain from indexing it?
With the query given, the optimiser can immediately figure out to get the latest record from the index and scan back through the index. If the index has included columns, it could scan the data straight off the index. Without the index you need to scan the entire table, sort it in memory, and then read off the top columns. If you were doing a top X query it would be more markedly faster by fetching less data from disk. But I think that query is getting all the records, but still it will be quicker by avoiding the in memory sort.
interesting, thanks for the insight. I haven't touched DB setup in many years, and even then was novice. The best person I knew told me to index if a column wasn't unique, but also wasn't something with only two or three choices. Sounds like I have more reading to do...
I'm certainly not an expert - I did a great DB course about 15 years ago and then used the skills every now and then since. I might not be up on the latest. And I am more of a SQL Server person. BUT... the main thing I learn is view actual execution plans, and see what is actually happening before adding indexes (unless it's an 'obvious' index).

> The best person I knew told me to index if a column wasn't unique, but also wasn't something with only two or three choices.

Yeah I think this is too broad advice, and you need to understand what you want to achieve. Mentally, choosing indexes is like choosing whether to use a hashtable vs. for loop vs. binary tree etc in an algorithm in code. There is not golden rule or "always use a hashtable if there are 100 entries" type thing. You just need to figure it out on a case by case basis. And usually there are only 2-3 tables in your DB worth a lot of effort in figuring it out!

Why would it be unique?
So, in fairness I didn't dig through the code to see the column, but assumed some granularity like milliseconds? It was a genuine question on whether it would speed up the query at all if indexed since it is generally grabbing everything then sorting.
Indexes generally give you sorting for free so you don't have to look at everything.
In general if it is something you will query in a where clause it should be indexed.