Hacker News new | ask | show | jobs
by dicroce 913 days ago
I always think discussions like this should start with the following: A database with no indexes is slow. Finding a particular row will require a linear search. Adding an index on a column means that your optimizing finding rows by the values of that column. Hence, an index is really a mapping of a particular column's value to the position in the db OF that row (very likely an 8 byte sized integer that is the offset into the file of the row in question).

This all means we can implement indexes as b-trees where the keys are the values of a particular column and the value is the file offset of the row with that value. You could envision a simple db format where indexes and the main row file are stored in separate files. In such a database you could drop an index simply by deleting the indexes file (or add one by creating it). The main row file actually has all of the data and so indexes can be recreated if necessary (at expense of course).

4 comments

A database with no indexes is slow. Finding a particular row will require a linear search.

The crux is understanding what data access patterns you will have and what indexes / data structures accelerate that access pattern. "Index = fast" is a painfully pernicious untrue meme. It's absolutely true for application tables with queries only touching a few rows. On the other hand, analytics queries touching a high proportion of rows with joins on equality conditions (ie. hash joinable) isn't going to go any faster with an index.

I've seen devs shotgun indexes at tables to fix performance (done it myself too) but the real test of index understanding is when that doesn't work.

> On the other hand, analytics queries touching a high proportion of rows with joins on equality conditions (ie. hash joinable) isn't going to go any faster with an index.

That's when you bring a BRIN to the table. :-)

Or change the layout entirely to clustered columnstore from row store.

All databases are datastructures on disk and memory optimized for specific access and write patterns.

> "Index = fast" is a painfully pernicious untrue meme.

I believe that lack of internalization of that meme (regardless of how true it is) can be a cause of real trouble.

I was working in a team where Java devs simply didn't bother to put indexes on tables because they were small (like 100 rows or so). When I (JS dev) pestered them long enough to finally do it suddenly the whole app got super snappy and they were very thankful as it happened just as degrading performance was causing a lot of gloomy mood.

If a table has 100 rows and those rows aren't huuuuge, there is no reason to put in index. The rows probably fit into one page or so anyways and are all read together anyways.

Why are you mentioning that it were Java devs ad that you are a JS (javascript?) dev? Does that give you any kind of... expertise in the matter?

In theory, with pages, caching and all, yes. In practice it made collosal difference.

With correct indexes the queries were able to be found only with index lookup without touching data at all. It was many times faster than sequential scan of even this few rows.

I'm mentioning our respective roles to show that people with nominally no expertise make such decisions often in practice and such "memes", even tough they are not strictly true in all cases, may bring a lot of value in practical setting.

"No inedex = slow" is a good heuristic for nearly all devs.

> With correct indexes the queries were able to be found only with index lookup without touching data at all

That doesn't make things faster, unless the rows contain so much data, that they force many page reads. Otherwise, reading the whole index or reading all rows takes the same time.

I think there is some information that you are not providing that might make a difference - or it is actually a misunderstanding on your side and if the queries really got faster, it's because of something else, such as a join.

If it didn't matter noone would ever use binary search for the amount of data that fits single page. Linear search would always be preferred as it is simpler.

The only thing I'm not providing are the exact sizes of the tables involved other than that they were considered small. Some having 100 rows but I can't rule out that some were 1000. I didn't say each of those tables fit a single page which I agree could be significant. I am completely sure that the difference was caused by adding indexes to small tables, nothing else. The improvement was achieved in a single day by a sigle person adding indexes where there were none. He observed improvement as he went from table to table. I was specifically thanked by him for pestering him about it. He was informal tech lead for this project. The application was fairly large at that time and performance issues were everywhere. What was also used everywhere were those small tables containing stuff like list of countries, languages, currencies, types of operations and such.

> isn’t going to go any faster with an index

Depends on the data, and the index. A covering index would almost certainly be faster. But you said analytics, which implies large results from huge datasets, so it’s unlikely to fit into memory in the first place.

> The crux is understanding what data access patterns you will have and what indexes / data structures accelerate that access pattern

We have a winner. But when looking at SQL tables/Views/Stored Procedures, the data is also stored in order in memory, in effect have a master database and files on disk, with sorted databases and files in memory for faster access.

> A database with no indexes is slow.

No it’s not… if all you do is write to it. In fact, it’s the fastest possible database for such case.

Indexes are pure redundancy - they contain the data already in the base table which must be maintained during writes.

But they can make reads so much faster, if the data access pattern can utilize them. The key is to identify access patterns which justify the price of the index.

In some cases they can make things worse, it's worth remembering that query optimizer looks not only on indexes, but also on statistics and estimated operation costs. If your statistics are out of date and your criteria are not specific enough (e.g. they match 80% rows), then an index is going to slow the query down. It needs to traverse the index to get the row IDs, fetch all the blocks containing them, read those, filter out irrelevant rows. It's probably going to be faster with a pure full table scan (due to linear reads).
If you only need to write and never read you don't need database.

> /dev/null will suffice.

And if you ever need to read anything even once database without indexes is slow

I can write to one database, replicate it (for example, by log shipping), and add an index only on the replica. This is not just being pedantic, this is a real-world pattern for some analytics solutions. You have a very high number of writes, and then build a reporting database at the end of every day, with all reads going to that database.
That is a very valid scenario. But when you do those things you already know costs and benefits of the indexes so you are not going to be harmed by "no indexes = database slow" heuristc.

"database = fast" is way worse heuristic to believe in for the people that need heuristics to move on with what they are doing.

> And if you ever need to read anything even once database without indexes is slow

This may be true in most cases, but not all. It just depends on your access pattern. If all you do are full table scans (possibly feeding to hash-joins), you won't benefit from indexes at all!

The whole point is that indexes do not auto-magically improve performance with no downsides. If they did, we could just index all columns and call it a day!

> If all you do are full table scans (possibly feeding to hash-joins), you won't benefit from indexes at all!

In most cases it means that you shouldn't be doing what youa re trying to do. Or best case, that maybe database is not the right tool for your job.

True, but it's still relevant in the early stages of iterating on a project. I'm familiar with one team that started investing in database-level optimization months before even beginning to deprecate their `loadAllRowsFromTheLargestTable` RPC.
To be pedantic: writes may also make use of indices, if you have constraints (like foreign keys) the db needs to check for every write.
True, but FK (in child table) must reference a key (in parent), and most databases won't let you create a key without the underlying index.

The other direction, however, is not a given: most DBs will let you create a FK on fields not covered by an index, so deleting or modifying a parent can benefit if you create such index explicitly, because it can check for the existence of children much faster (and avoid potentially locking the entire table). Again, the access pattern governs what indexes are needed: if you never delete/modify parent, you may not need an index on FK (unless you also have some queries which can use it, of course).

I've implemented a high performance btree this way in the past, where each table and each index were separate files (with append-only writes for concurrency). It worked pretty well and wasn't hard to get right, but it had some downsides (in particular, the kernel seemed to struggle with all the paging.)
> a high performance btree

then …

> the kernel seemed to struggle …

What was the struggle? If it’s performance doesn’t that contradict your earlier statement?

Genuinely interested in what the issue was, not trying to be a pedant

> What was the struggle?

Performance is always great until you have to hit disk. Not uncommon to rely on mmap at which point your disk access is sub-optimal vs. a hand-tailored buffer manager with strategies to improve disk reads.

This requires the obligatory https://db.cs.cmu.edu/mmap-cidr2022/
The linux kernel lets you trigger asynchronous writes of the pages as well as synchronous barriers to ensure they've been written.

You don't need to use direct I/O to have fine control.

the purpose of a btree is to optimize when you are hitting the disk, you can't call that the struggle, that's when the btree sings (tho you could consider extensible hashing)
My throughput was significantly higher than sqlite (4x or so, if memory serves), but the kernel spent so much time swapping pages that the mouse cursor stuttered.

A custom page manager would have probably done the trick, but I don't have the technical chops to write one.

You could envision a simple db format where indexes and the main row file are stored in separate files

They already envisioned it in DBF and CDX.

This is how systems handled it before relational was widely adopted, for example the IBM System 36.