| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Scaevolus 3143 days ago
	For context, SQLite4 explored reimplementing SQLite using a key-value store on log-structured merge trees, like RocksDB and Cassandra. I'd be interested to hear why they stopped. Presumably reimplementing SQL on a KV store was seen as not worth it, when applications that are satisfied with an embedded KV store backend (which is much faster and simpler to write!) already have many options.

6 comments

makmanalp 3143 days ago

I've had the chance to hear Richard Hipp talk about SQLite yesterday! He mentioned that the LSM tree storage engine is available as an extension to sqlite3. More specifically, he mentioned that he didn't really get the performance improvements he had hoped for, for insertion-heavy use cases.

I think part of this is because of a fundamental limitation of sqlite that it's an embedded database that has to persist data on disk at all times: The design of LSM trees works well with databases with a resident in-memory component because it's an approximation of just dumping every new thing you see at the end of an unordered in-memory array. This is as opposed to a data structure like a b-tree where you have to /find/ exactly where to put the data first, and then put it there. This finding bit means you're doing a lot of random access in memory, which is thrashing all of your caches (CPU / disk etc). LSM trees avoid this thrashing by just dumping stuff at the end of an array. However this means you have to scan that array to do lookups (as opposed to something easier like binary search). Then as your array gets big, you merge and flush it down to a lower "layer" of the lsm tree which is slightly bigger and sorted. And when that one fills, you flush further. And these merge-flushes are nice big sequential writes so that's nice too.

Anyway, with SQLite, the highest layer of your LSM tree would probably (this is conjecture) have to be on disk because of the way that there is no server component, versus in an in-memory system it'd probably be in your L2/L3 cache or at least your main memory. So this could be one reason why that model didn't work out as well for them.

coleifer 3143 days ago

I'm jealous you got to hear Dr. Hipp, that sounds cool. Would love to hear more about the circumstances :)

Regarding the LSM engine, you can find all the relevant implementation details here: https://sqlite.org/src4/doc/trunk/www/lsm.wiki#summary

> The in-memory tree is an append-only red-black tree structure used to stage user data that has not yet flushed into the database file by the system. Under normal circumstances, the in-memory tree is not allowed to grow very large.

8ig8 3143 days ago

I stumbled on this video presentation by Dr Hipp recently (may have been from another HN comment). I really enjoyed it. Probably because he seems so enthusiastic and passionate.

SQLite: The Database at the Edge of the Network with Dr. Richard Hipp

https://m.youtube.com/watch?v=Jib2AmRb_rk

makmanalp 3143 days ago

I got lucky! It was a class event, and he's known to speak at databases / data systems classes. He's a very fun (and opinionated!) speaker.

> The in-memory tree is an append-only red-black tree structure used to stage user data that has not yet flushed into the database file by the system.

Hmm, ok, so this contradicts my assumption. Actually, now that I think about it, other LSMs like rocksdb / leveldb work like this too (library-like model with some in-memory component when you "open" the database).

Anyway, without diving into the details of the code, there's other technical decisions that would affect this stuff.

One thing is how big your in-memory structure is (in relation to available memory and insertion workload) and how often you flush to disk is a key thing. Another thing is what your LSM tree looks like - aside from the data structure, how many tiers/levels you have is a big thing. I assume some of these are configurable parameters. E.g. rocksdb has an enormous set of parameters that handles this stuff. It's also annoying to tune.

I found this benchmark here that is illustrative: https://sqlite.org/src4/doc/trunk/www/lsmperf.wiki

The first graph is underwhelming, but when you adjust the buffers (look at the last graph) ~250k writes / second constant regardless of database size (this is why you want an LSM tree) is darned good! And this is on a spinny drive, not an SSD. Their "large" buffer sizes aren't even that large IMHO.

So maybe his mention that the LSM storage was underwhelming was overblown :-) I don't know.

Another difference is with other LSM-based systems that aren't just key-value, it's usually in the context of column stores: you keep a separate LSM for each column family (could be 1-n columns). But I can't think off the top of my head how this would cause a difference. Perhaps in how reads happen - the query engines work quite differently.

Anyway, my talk is cheap, I'm just guessing here, actually doing the analysis is the hard work :-) Also, I'm something of an amateur currently, so take my words with a grain of salt. Anyone else have any ideas re: this?

rurban 3143 days ago

This is the other perf graph: https://sqlite.org/src4/wiki?name=LsmPerformance

rattray 3143 days ago

What would be performance for SQLite3 in comparable scenarios? I don't see anything comparative on that benchmarks page.

makmanalp 3143 days ago

I've gotten it to do around 40-50k inserts / sec but that's a different scenario - nfs drive, different table and indexes, different queries, different configuration, etc etc. Also I didn't know if he meant that the inserts were disappointing or the overall results were (e.g. a suite of tests including reads / writes of all sorts).

X86BSD 3143 days ago

I just want to hijack this thread to say hipps other creation Fossil SCM is a great SCM. Better than git imo. Everyone should check it out.

petre 3143 days ago

We've been using it for the last 3/4 years. It's great and really user friendly, with integrated help and a web interface - everything in a single binary. The trouble with it is that it gets slower once you have a lot of history and many files. I don't know if you can use it for huge things like the Linux kernel or the FreeBSD ports tree. I once tried to import the ports tree into fossil and gave up after 2G and an hour. It will import anything that can do git fast-export. Now it also imports svn dumps as well. Fossil is a very good replacement for SVN. You can set up a central repo where everyone syncs on commit and update.

RainaRelanah 3143 days ago

To be fair, Fossil's intended use case is the exact opposite of the Linux kernel. See 3.3 and 3.4 of the fossil vs git page[0].

[0] https://www.fossil-scm.org/xfer/doc/trunk/www/fossil-v-git.w...

interfixus 3143 days ago

Why does everybody persist in calling the great man Dr.?

He refers to himself as D. Richard Hipp.

ptrott2017 3143 days ago

D is an initial but Dr. D. Richard Hipp has a PhD from Duke - graduated in 1992 (his thesis can be found here - its worth a read: http://bit.ly/2ygiDWx ) hence the Dr.

interfixus 3143 days ago

Thank you. Always thought it was a weird comprehension error from the slightly unusual character-combo.

The weird error may have been mine.

makmanalp 3143 days ago

Ooops!

thesz 3137 days ago

I did something like that - LSM backend for OO-relational DBMS (inhouse).

I also hoped for big win on the insertion-heavy loads, and I also haven't succeed in that. The problem is that every insert statement must read back something from DB to verify DB state against schema for correctness. As reads in LSM are slower, the net win is either absent or negligible. I have to say I wrote "must" in sentence above because you sometimes can get away without reading back, but not always. In the end, worst case scenario is always "read and write", not just "write".

But!

I devised a scheme to lay out layers' data so that they are as contiguous as they can be. Or get a very good approximation to that contiguousness, basically (O(1) "pages" per level). Thus contiguous reads got very high performance and beat old storage on read scheme, despite the need of level merging, etc.

katastic 3143 days ago

Perhaps I'm misunderstanding you, but sqlite supports ramdisks/"In-memory databases"

https://www.sqlite.org/inmemorydb.html

Perhaps the key (ha!, pun) is that you're talking about using RAM _and_ disk with the RAM being for caching/fast access that eventually hits the disk. Whereas, I think, in this case sqlite is either on the disk, or in RAM. There is no multiple tiers.

Correct me if I'm off here. Thanks.

makmanalp 3143 days ago

> Perhaps the key (ha!, pun) is that you're talking about using RAM _and_ disk with the RAM being for caching/fast access that eventually hits the disk.

This is closer to what I mean but not quite. Specifically, in-memory (or main-memory) is a technical term that talks about a specific type of database that focuses on doing operations /mostly/ in memory, with some spillover to disk as necessary, but only as an edge case. You usually then handle persistence with a sequential transaction log, perhaps in NVDIMM storage if you have some. This is in contrast to other systems where the memory is a buffer, but the "actual stuff" happens on disk. There is of course many hybrid schemes - it's really more of a spectrum. Some examples: SAP HANA, ArangoDB, MapD, MemSQL, MS SQL Server Hekaton.

There's also a lot of techniques that come with this style of thinking - often these are column stores, and often since these databases don't really hit disk, the bottleneck is moving data in and out of memory to the CPU caches, and sometimes just CPU speed. Often these use lightweight compression and SIMD instructions to tackle those problems.

So SQLite's "in-memory" database scheme doesn't quite count as something like this, it's more of a disposable database. But that's ok - it's not bad, it's just a different thing.

> Whereas, I think, in this case sqlite is either on the disk, or in RAM. There is no multiple tiers.

See the thread above - I kind of assumed this was the case, but like you mentioned, it looks like with the LSM implementation it does kind of sort of have a memory component - it lasts until you close the database "connection", and gets flushed often.

infogulch 3143 days ago

Actually in WAL mode with `PRAGMA synchronous = NORMAL`, it only flushes to disk periodically, not after every transaction.

robotbikes 3143 days ago

According to a presentation Richard Hipp did at Ohio State, they just didn't get the same level of performance from a log-structured merge vs. the b-tree in SQLite 3. If I remember correctly, while log structured merges where able to accumulate in memory actual to consolidate writes there was still a number of reads required to support joins etc. So they just determined based upon their testing that the b-tree based code was just faster. I don't recall the exact benchmarks, it is probably on the web somewhere.

sliverstorm 3143 days ago

Everything I hear about SQLite3 always suggests that, essentially, it is considered "done". It does what it is supposed to do with great performance. There is nothing major left to do. If it doesn't meet your needs, pick a different SQL database.

Which, while a totally alien concept in the modern software world, is actually a pretty cool thought.

(I'm sure under the hood bugs are getting fixed and all)

Someone 3143 days ago

Done? Almost every .0 version I checked in https://www.sqlite.org/chronology.html saw some new features.

ianamartin 3143 days ago

Well, obviously they should rewrite it in Rust.

dotancohen 3143 days ago

Actually having an alternative implemention would be terrific.

Look how Open Office development charged after the introduction of Go OO, or MySQL after MariaDB, or more generally the entire smartphone market after the introduction of the iPhone. Competition is good.

That said, Dr. Hipp is known to be passionate about SQLite. In my previous examples the software was pretty much neglected before the competition came, which is certainly not the case with SQLite.

greenhouse_gas 3143 days ago

From the canonical source (as in, [Richard Hipp](https://en.wikipedia.org/wiki/D._Richard_Hipp)):

Rewriting SQLite in Rust, or some other trendy “safe” language, would not help. In fact it might hurt.

Prof. Regehr did not find problems with SQLite. He found constructs in the SQLite source code which under a strict reading of the C standards have “undefined behaviour”, which means that the compiler can generate whatever machine code it wants without it being called a compiler bug. That’s an important finding. But as it happens, no modern compilers that we know of actually interpret any of the SQLite source code in an unexpected or harmful way. We know this, because we have tested the SQLite machine code – every single instruction – using many different compilers, on many different CPU architectures and operating systems and with many different compile-time options. So there is nothing wrong with the sqlite3.so or sqlite3.dylib or winsqlite3.dll library that is happily running on your computer. Those files contain no source code, and hence no UB.

The point of Prof. Regehr’s post (as I understand it) is the the C programming language as evolved to contain such byzantine rules that even experts find it difficult to write complex programs that do not contain UB.

The rules of rust are less byzantine (so far – give it time :-)) and so in theory it should be easier to write programs in rust that do not contain UB. That’s all well and good. But it does not relieve the programmer of the responsibility of testing the machine code to make sure it really does work as intended. The rust compiler contains bugs. (I don’t know what they are but I feel sure there must be some.) Some well-formed rust programs will generate machine code that behaves differently from what the programmer expected. In the case of rust we get to call these “compiler bugs” whereas in the C-language world such occurrences are more often labeled “undefined behavior”. But whatever you call it, the outcome is the same: the program does not work. And the only way to find these problems is to thoroughly test the actual machine code.

And that is where rust falls down. Because it is a newer language, it does not have (afaik) tools like gcov that are so helpful for doing machine-code testing. Nor are there multiple independently-developed rust compilers for diversity testing. Perhaps that situation will change as rust becomes more popular, but that is the situation for now.

https://blog.regehr.org/archives/1292#comment-18452

Santosh83 3143 days ago

Even if Rust is at parity with C in terms of tooling and ecosystem (which it probably will be in a few short years), SQLite is probably not high on the list of software to rewrite in Rust, given its fairly high quality. Let's rewrite the vulnerability-ridden ones first.

steveklabnik 3143 days ago

> it does not have (afaik) tools like gcov

You can just use C tooling with Rust, generally; I know kcov has worked for a long time, but it looks like easy gcov support does exist: https://github.com/kennytm/cov

rurban 3143 days ago

Rewriting in Rust would not make any sense, sure. But rewriting in a faster and safer language, like Pony or ATS might, esp. for lots of CPU's.

sametmax 3143 days ago

Although i get the joke, it would be a fun and interesting experiment wouldn't it ? Espacially since the sqlite test suite is so exhaustive.

malkia 3143 days ago

There is a version rewritten in C#, back in the day - not sure how up to date is:

https://www.infoq.com/news/2009/08/Porting-SQLite-to-.NET-in...

Annatar 3143 days ago

Because we’ve missed the Rust evangelism strike force so very much.

I honestly hope against hope that you’re being sarcastic.

coleifer 3143 days ago

Check out the changelog...they've added a ton of new stuff. Native json support, completely new full-text search extension, lsm key/value extension, performance improvements. I think they're looking at some changes to locking in the near future as well. Lots of stuff to find if you look.

Oxitendwe 3143 days ago

SQLite is one of my favorite pieces of software for this exact reason, that it is actually a more or less "finished" program instead of a mire of shifting design requirements and constant security updates, a comfortably static and unchanging object against the chaotic backdrop of modern software development.

manigandham 3143 days ago

There is also a project called UnQLite for UnQL (unstructured query language) designed to be an embedded document store, announced in 2011 [1]. Looks like it's now maintained by another company. [2]

1. https://www.infoq.com/news/2011/08/UnQL

2. https://unqlite.org/

tyingq 3143 days ago

Would have been a neat way to experiment around with putting a SQL front end on various KV interfaces. Redis or etcd, for example.

michaelmior 3143 days ago

CockroachDB has a good blog post[0] that describes how they implemented SQL. (CockroachDB is a key-value store.)

[0] https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mappin...

zaarn 3143 days ago

>(CockroachDB is a key-value store.)

Correction: CockroachDB is based on a KV store, it's a full SQL RDBMS on top of this KV store.

michaelmior 3143 days ago

Yes, that note was misleading. Thanks for clarifying :)

electrum 3143 days ago

Presto has connectors for various types of non-SQL databases including Redis: https://prestodb.io/docs/current/connector/redis.html

Presto is a distributed SQL query engine for big data, so basically the complete opposite of SQLite, though it often gets used in federation scenarios, as does SQLite.

An interesting anecdote is that the team working on what would become osquery (https://osquery.io/) asked if they could reuse the SQL parser from Presto. We get that question a lot, and after explaining that the parser is the easy part (semantic analysis and execution is the real work), I determined that what they really wanted were SQLite virtual tables: https://sqlite.org/vtab.html (and those worked out great for them)

nerfhammer 3143 days ago

Fairly straightforward to make a mysql storage engine for it, e.g. https://github.com/AALEKH/ReEngine

alphaalpha101 3143 days ago

Scaevolus, I recognise that name. #0x10c-dev on freenode? rmmh? skybot?

I created a logsql.py plugin for skybot, it just logs to the DB instead of to text files. Is that something you'd be interested in merging back in?

striking 3143 days ago

Developer of some optimization mods for Minecraft, one of which became OptiFine.