| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rjn945 5216 days ago

I think this has a lot to do with it. After an hour of reading, watching and thinking, I can't come up with any way to put it into one paragraph.

Here's the shortest what and why I could come up with:

Questioning Assumptions

Many relational databases today operate based on assumptions that were true in the 1970s but are no longer true. Newer solutions such as key-value stores ("NoSQL") make unnecessary compromises in the ability to perform queries or make consistency guarantees. Datomic reconsiders the database in light of current computer set-ups: millions of times larger and faster disks and RAM, and distributed architectures connected over the internet.

Data Model

Instead of using table-based storage with explicit schemas, Datomic uses a simpler model wherein the database is made up of a large collection of "datoms" or facts. Each datom has 4 parts: an entity, an attribute, a value, and a time (denoted by the transaction number that added it the database). Example:

  John, :street, "23 Swift St.", T27

This simple data model has two main benefits. It makes your data less rigid and hence more agile and easier to change. Additionally, it makes it easy to handle data in non-traditional structures, such as hierarchies, sets or sparse tables. It also enables Datomic's time model...

Time

Like Clojure, Datomic incorporates an explicit model of time. All data is associated with a time and new data does not replace old data, but is added to it. Returning to our previous example, if John later changes his address, a new datom would be added to the database, e.g.

  John, :street, "17 Maple St.", T43

This mirrors the real world where the fact that John has moved does not erase the fact that John once lived on Swift St. This has multiple benefits: the ability to view the database at a point in time other than the present; no data is lost; the immutability of each datom allows for easy and pervasive caching.

Move Data and Data Processing to Peers

Traditionally databases use a client-server model where clients send queries and commands to a central database. This database holds all the data, performs all data processing, and manages the data storage and synchronization. Clients may only to access the data through the interface the server provides - typically SQL strings which may include a (relatively small) set of functions provided by the database.

Datomic breaks this system apart. The only centralized component is data storage. Peers access the data storage through a new distributed component called a transactor. Finally, the most important part, data processing, now happens in the clients, which, considering their importance, have been renamed "peers".

Queries are made in a declarative language called Datalog which is similar to but better than SQL. It's better because it more closely matches the model of the data itself (rather than thinking in terms of the implementation of tables in a database). Additionally, it's not restricted like SQL. It allows you to use your full programming language. You can write reusable rules that can then be composed in queries. Additionally, you can call any of your own functions. This is a big step up in power and it's made practical because of the distribution. If ran your query on central server, you'd have to be concerned about tying up a scare resource with a long-running query. When processing locally, that's not a concern.

When a query is performed that data is loaded from central storage and placed into RAM (if it will fit). Later queries can use this locally cached data for fast queries.

----

That's definitely not all it does or all the benefits, but hopefully that's a good start.

2 comments

programnature 5216 days ago

I would add the following

*Transactions as first-class entities

Transactions are just data like everything else, and can add facts about them like anything else. For example, who created the transaction. What did the database look like before and after transaction.

Additionally, you can subscribe to the queue of transactions, if you wanted to watch for and react to events of a certain nature. This very difficult in most other systems.

link

anamax 5216 days ago

> a time (denoted by the transaction number that added it the database).

Do transaction numbers have total order or just partial order? Total order is serializing. (And no, using real time as the transaction number doesn't help because it's impossible to keep an interesting number of servers time-synched.) Partial order is "interesting".

link

programnature 5216 days ago

It is totally ordered.

The transactor is a single point of failure.

However, since its only job is doing the transactions, the idea is it can be faster than a database server that does both the transactions and the queries.

link

snprbob86 5215 days ago

Hmm... presumably an application can act in read-only mode in the absence of a transactor. That's an interesting thought :-)

link

weavejester 5215 days ago

The transactor is only used for writes, so if the transactor went down, you could still run queries.

link

alkby 5214 days ago

I think their statement about ACID is too bold.

How does somebody do read-"modify" style of transactions ?

Say I want to bump some counter. So I delete old fact and I establish new fact. But new fact needs to be exactly 1 + old value of counter. With transactions as simple "add this and remove that" you seemingly cannot do that. So it's not ACID. Right?

link

richhickey 5214 days ago

Transactions are not limited to add/retract. There are also things we call data functions, which are arbitrary, user-written, expansion functions that are passed the current value of the db (within the transaction) and any arbitrary args (passed in the call), and that emit a list of adds/retracts and/or other data function calls. This result gets spliced in place of the data function call. This expansion continues until the resulting transaction is only asserts/retracts, then gets applied. With this, increments, CAS and much more are possible.

We are still finalizing the API for installing your own data functions. The :db.fn/retractEntity call in the tutorial is an example of a data function. (retractEntity is built-in).

This call:

    [:db.fn/retractEntity entity-id]

must find all the in- and out-bound attributes relating to that entity-id (and does so via a query) and emit retracts for them. You will be able to write data functions of similar power. Sorry for the confusion, more and better docs are coming.

link

danieljomphe 5214 days ago

From what I remember, compare-and-swap semantics are in place for that kind of case.

If that was not the case, you could still model such an order-dependent update as the fact that the counter has seen one more hit. Let the final query reduce that to the final count, and let the local cache implementation optimize that cost away for all but the first query, and then incrementally optimize the further queries when they are to see an increased count.

That said, I'm pretty sure I've seen the simpler CAS semantics support. (The CAS-successful update, if CAS is really supported, is still implemented as an "upsert", which means old counter values remain accessible if you query the past of the DB.)

link

danieljomphe 5213 days ago

Forget my last paragraph. Anyways, richhickey answered. :)

link

anamax 5215 days ago

> However, since [the transactor's] only job is doing the transactions

Huh? How is that consistent with:

> access the data storage through a new distributed component called a transactor.

If "doing the transactions" consists of more than passing out incrementing transaction tokens, won't the transactor be a bottleneck?

link

rjn945 5215 days ago

Yeah, it looks like I got that part wrong. (I intentionally skimmed over the transactor, because I was avoiding "how" issues and because my understanding of it wasn't that clear.)

The transactor is involved in just writes, not reads. (So that helps.) It's not distributed and cannot be distributed, in this system, because it ensures consistency, so yes, it is potentially a bottleneck. In blog comments by Rich Hickey[1], he states:

"Writes can’t be distributed, and that is one of the many tradeoffs that preclude the possibility of any universal data solution. The idea is that, by stripping out all the other work normally done by the server (queries, reads, locking, disk sync), many workloads will be supported by this configuration. We don’t target the highest write volumes, as those workloads require different tradeoffs."

Presumably, 1) the creators of Datomic think that performance can be good enough to be useful, 2) this is a new model that probably requires testing to prove is practical.

[1] Multiple people have linked to it, but for convenience: http://blog.fogus.me/2012/03/05/datomic/comment-page-1/#comm...

link

sausagefeet 5215 days ago

Isn't this the same compromise we would already have had to make if we just used postgres?

link

weavejester 5215 days ago

It's actually slightly better than a SQL database. If your master SQL database gets fried, there's a chance you could lose some data. Datomic's transactor only handles atomicity, not writes, so if the transactor dies, nothing written to the database will be lost.

link