| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by spanktar 4135 days ago
	That's part of it, but not the whole picture. For example, we mostly bypass the ES query engine and go directly to Lucene. Queries are not simply translated to ES query syntax. Also, we've done a lot more work than simply pasting an SQL layer over the top. We've built streaming BLOB support, a distributed SQL layer with real-time MapReduce, and a distributed aggregation engine that gives accurate results for aggregations rather than HLL estimates. If you'd like we're happy to answer any questions in IRC or our Google Group: https://groups.google.com/forum/#!forum/crateio IRC Freenode #crate: irc://irc.freenode.net/crate @mention anyone with Voice

2 comments

naiv 4135 days ago

do you have a field type that indexes in real time? or are you bound to the (default 1s) index delay from es?

this is one thing that bothers me with elasticsearch, that I can not define eg "type": "cart","index":"realtime", "not-analyzed" so if an item gets added to a cart, the subsequent count would directly return the correct number of items in the cart.

link

jodok 4135 days ago

not yet. but we have some "tweaks" for exactly your use-case on our backlog. using the client libraries should make it mucn easier (e.g. https://crate.io/docs/projects/crate-python/stable/sqlalchem...). so right now you would need to do a refresh. on a side note: it's not an index delay. it's the readers that "sit" on the lucene index. they are being repurposed for performance reasons (and meanwhile other writes are appending). like the client libraries you can force reopening them (https://crate.io/docs/en/0.47.8/sql/reference/refresh.html) - of course at the cost of performance.

link

ddorian43 4135 days ago

Why can't you aggregate on non-indexed fields? I know lucene doesn't allow that, but why? It seems to work on normal-rdbms ?

link

jodok 4135 days ago

We run aggregations fully distributed and when iterating over the values we heavily rely on the field-caches. They hold the values of the latest used fields in memory and therefor allow in-memory performance on them. for example they don't grow linearly with the amount of rows stored, but depend on the cardinality of the fields. Running aggregations over non-indexed data is not supported.

link