Hacker News new | ask | show | jobs
by stickperson 3380 days ago
I really enjoy Discord's blog. Their Cassandra write up was excellent as well. A couple of thoughts and questions:

- Having many clusters and assigning messages to a specific cluster seems like an interesting solution.

- I'm curious how they managed to lazily index messages.

- Since only message, channel and server ids are stored in ES, have there been any problems reindexing data after an index fails?

2 comments

The first time you run a search in a server (or the first time you run a search in a server after the index fails) - will trigger a full re-index of that server. Ctrl-F "Historical Index" in the blog post for more details! If you've never used search in a server - the messages are not indexed in real time until you do for the first time. Both these things make the system "lazy".

The worst case to an index failure is that the search query is delayed as the index rebuilds itself. We throttle the rate of historical indexing into ES to a safe level so that we're not degrading performance of other components of the system.

Oh, I think I get it now-- is it that the _initial_ indexing is lazy, but all indexing after that is done automatically by the historical index workers? Basically when a user searches for something do you check that ES has something for that user, if it doesn't start off the initial indexing process, and from there the workers do their thing?
The historical index workers index the history of the server, whereas we have a real-time index worker that is consuming and bulk inserting messages in real time. Searching for the first time in a server turns on real-time indexing and triggers a historical index of all previous messages in that server.
Can you talk about the bulk insert for real time messages ? How does this get triggered - does it run every X minutes or X messages (your code looks like it is every X messages).

Are you using DB triggers to fire the job ?

Got it. I was under the impression that all messages were lazily indexed. After reading the article again it's pretty clear that's not the case. Thanks for the clarification.
I agree with you, and I've got a question as well.

I'm wondering how long does it take to execute the ES refresh on a search query when the Shard was marked as dirty?

If the search requests are mostly real time, I suspect this is really short, but if the Shard ingest new messages for a while (let's say 50 minutes) and it's marked as dirty, a search query would ask ES to refresh 50 minutes worth of documents before running the actual query.

As it shown to be a problem? Is the refreshing time growing along with the number of documents inserted since the last refresh?

Good question. So far we've noticed the refresh time to be negligible (worst case in the tens of milliseconds). It's worth noting that most of the cost of doing a search on Discord is in pulling the message context from Cassandra to provide enough data to render the results in the client.
I'm impressed, thanks for answering.