Hacker News new | ask | show | jobs
RC1 ArangoDB 3.4 – What’s new? (arangodb.com)
60 points by rubbercasing 2848 days ago
3 comments

I'm new to ArangoDB but this release looks impressive just by the sheer number of new features. Congrats!

I was wondering if this release allows FULLTEXT indexes when the backend is RocksDB (now that it is the default storage engine)? The new ArangoSearch features look cool, but honestly a bit daunting vs the simple setup of a FULLTEXT index.

By the way, the ArangoSearch tutorial casually talks about "ArangoDB views of type 'arangosearch'", but I haven't come across the concept of views before in the documentation. Are there other types of views?

Currently there are no other types of views. But they are planned and will follow.
Love each of the new releases. Would appreciate hearing any stories of performance implications with the 'Distributed COLLECT' improvements.
Thanks! We have a very brief description of the "distributed COLLECT" feature here: https://github.com/arangodb/arangodb/blob/3.4/Documentation/...

More beef to be added to this until the GA release.

The benefits of distributed COLLECT will come into play for queries that can push the aggregate operations onto the shards. Previous versions of ArangoDB shipped all documents from the database servers to the coordinator, so the coordinator would do the central aggregation of the results from all shards to produce the result.

With distributed COLLECT we now create an additional shard-local COLLECT operation that performs part of the aggregation on the shards already. This allows sending just the aggregated per-shard results to the coordinator, so the coordinator can finally perform an aggregation of the per-shard aggregates.

This will be beneficial in many cases when the per-shard aggregated result is much smaller than the non-aggregated per-shard result.

Following is a very simple example. Let's say you have a collection "test" with 5 shards and 500k simple documents that have just one numeric attribute (plus the three system attributes "_key", "_id" and "_rev"):

    db._create("test", { numberOfShards: 5 }); 
    for (i = 0; i < 500000; ++i) {
      db.test.insert({ value: i });
    }
Running a query that will calculate the minimum and maximum values in the "value" attribute can make use of the distributed COLLECT:

    FOR doc IN test 
      COLLECT AGGREGATE min = MIN(doc.value), max = MAX(doc.value) 
      RETURN { min, max }
The database servers can compute the per-shard minimum and maximum values, so they will each only send two numeric values back to the coordinator.

Without the optimization, the database servers will either send the entire documents or a projection of each document (containing just each document's "value" attribute back) to the coordinator. But then each shard would still have to send 100k values on average.

With a local cluster that has 2 database servers and runs them on the same host as the coordinator, this simple query is sped up by a factor of 2 to 3 when the optimization is applied. In a "real" setup the speedup will be even higher because then there will be additional network roundtrips between the cluster nodes. And in reality documents tend to contains more data and collections tend to have more documents. If this is the case, then the speedup will be even higher.

Graph database aficionados, I have a question: What's the deal with most graph databases having a very limited number of types? It would be nice if we had more robust numeric types (integers and decimals rather than just doubles) and timestamps for example.
Maybe Agensgraph would work for you? Based on postgresql, so you can use any of the postgresql types. Fairly new though.

https://www.postgresql.org/about/news/1848/

This is on the list of more than one teammate here at ArangoDB. But also not trivial to implement.
Can you elaborate why that is? Of all types I could come up with (beyond bool), integers and unsigned integers seem the most basic.
The simplest explanation is that ArangoDB uses JSON as dataformat to the outside world. JSON doesn't support these types like arbitrary exact decimal, or timestamps. Despite ArangoDB using VelocyPack internally, which is capabable of much more than JSON, a user will import JSON and get JSON back.

You can of course use datetime https://docs.arangodb.com/3.3/AQL/Functions/Date.html and decimals with a precision of 10E38 in ArangoDB but it is not as precise as in a relational database. If we want to be as precise as a relational DB, then we would have to say goodbye to JSON

MongoDB uses this to express decimals as JSON: { "$numberDecimal": "<number>" }
I suspect creating a specification based on what mongodb does/did might be the better approach - but a quick search for "typed json" turned up:

https://www.tjson.org/

Not sure if I'm a fan of the syntax - but some kind of open, sane, standard would be nice.

I can't speak for all graph stores, but in addition to ADB being json-native as described below, many graph users describe the types of any value, not just floats or ints, as a relation or property to the value itself.

So you'd never have any value; object, key, value of a string, int, float, or reference, without associated meta-data typing it elsewhere in the graph, and would be unlikely to operate on that data without making reference to those properties.

Neo4j has datetime types, spatial point type, int, float, etc.

See https://neo4j.com/docs/developer-manual/current/drivers/cyph...