This actually looks pretty interesting. I appreciate their FAQ has a great answer to "What is ArangoDB and for what kind of applications is it designed for?" -- more projects need to offer this kind of statement. https://www.arangodb.com/faq
I like this project and am keeping an eye on it, but tbh that answer doesn't really answer the question in a way that seems objective. It just says it's a "general purpose database offering all the features you typically need for modern web applications".
Does ArangoDB use the same storage strategy MongoDB does? From the FAQ:
"So how much RAM do you need? This depends on the size and structure of your data: Your application will access one or many collections (think of collections as denormalized tables for the time being). Once you open a collection the indexes for this collection are created in the RAM and the data is loaded into the RAM using memory-mapped files. If your collections are bigger than your RAM, the operation system will be forced to swap data in and out of the swap space."
I'm not an expert, but a lot of people seem to harp on MongoDB for this very reason. Does ArangoDB use the same strategy? If not, how is it similar/different?
In principle, ArangoDB behaves similarly to MongoDB here. Both are essentially "mostly-in-memory" databases in the sense that they hold the data in memory and persist it at the same time to disk via memory mapped files. This approach is good for performance and if you run out
of RAM you ought to shard your data.
However, MongoDB often uses a lot of memory for the actual data, since its BSON binary format stores the names of the attributes with every single document. ArangoDB detects similar shapes of documents (see https://www.arangodb.com/faq#how-do-shapes-work-in-arangodb) and thus
avoids this particular problem.
I have been bitten by this using MongoDB as well. The shape recognition of ArangoDB sounds very useful. If this works well, it would alleviate a problem that NoSQL solutions so far have in comparison to classical relational databases.
Interesting article. An obvious reaction is to say: "In a document store, not all joins will be efficient in a sharding situation!". This is true, but certain queries involving joins backed by the right secondary indexes will indeed scale well, therefore one should not use this argument as a reason not to implement joins at all.
Say you have one collection for your people (sharded over 100 servers, say) and another one for conferences (also sharded over 100 machines). Then you could hold the primary keys of all conferences a person attended in a JSON list stored in an attribute with the user. A query finding all people with last name "Jones" that have attended a given conference can now be executed efficiently by using a secondary index on the last name of people and performing a key lookup in the conferences collection. The latter only has to talk to one shard, if the conferences collection is sharded by key and can thus be done efficiently. Obviously, one needs a query optimizer that is aware of the distribution of the shards and the shard keys, but this is certainly doable.
just because the dataset is sharded doesn't mean that one query has to hit every shard. for example, suppose you're looking for documents with `parent_id = foo` and your sharding key is `parent_id`, then an intelligent query planner would only query one shard (the one that "foo" hashes to), and then this looks a lot like a join in an RDBMS. indeed, if you wanted to do (in RDBMS terms) a self-join to load the whole tree of documents rooted at parent_id = foo, and your sharding key were the root for each document, that query would only hit one shard with a. the trick is deciding which keys to shard on (and, in many cases, what other keys to shard on in redundant datastores that serve different types of queries).
Thanks for both your answers, this is really interesting indeed. I always thought that joins are a "no, no, no" in the NoSQL world. This opens up a whole lot of new possibilities. I will have to have a look at this ArangoDB thing...
Is there a rule of thumb, in which situation you would model your connection as foreign key and in which situation you would model it as graph? Or do you always use graphs?
Another good rule I tend to use is that if your queries will involve variable lengths of paths in your graph, it was probably a good idea to model using a graph. This is because another model would almost certainly need multiple joins, which can kill performance quite quickly.
I think if you are connecting the same type of objects (i.e. users) you should use graphs. If you have a 1:n relation between different types, you could as well use foreign keys. For n:m you again need graphs.
Having a 1:n relation which you might want to annotated with, for instance, "type of relation" it is also feasible to use the graph model, as edges can carry attributes.
This is an argument one often hears. However, V8 is encapsulated quite well, since chrome has the same issue.
Furthermore, these micro services can actually improve security: You can implement your own scheme for authentication and authorisation on the document level and deploy it to the database. Then, if your application has various clients for different devices, they are all authorized in the same way by the same code. This leads to a simplification in app development and thus to more security, because there are fewer places to get right and the whole approach is less error prone.
ArangoDB was only started in 2012 but many years of experience in developing special-purpose database solutions went into it. This is how the rapid evolution into a market-ready product was possible at all.
Foxx is designed as the extension framework for ArangoDB and so it does not really make sense to rip it out of the DB kernel. Furthermore, a lot of its advantages would vanish if it does not longer have immediate and rapid access to the data.
It's hard to know for sure, of course, but according to the data we look at, some of these comments appear promotional rather than organic discussion.
That's not to say this isn't a great database, and we admire anyone who's undertaking a hard project. But there are proper and improper ways to get attention on HN. This one appeared to cross a line, hence my comment.