Hacker News new | ask | show | jobs
by tmikaeld 1524 days ago
Interesting, I wonder what they used for the search database, since that enormous amount of text can't fit in RAM, it would have to be partitioned & sharded in something like Scylla DB
7 comments

At one point, their search used LucidWorks Fusion [1], a commercial product based around Solr (that uses Lucene indexes under the hood) but also integrates a vector database and the like for semantic search. The linked wiki page still has Lucene-style queries.

[1]: https://lucidworks.com/customers/reddit/

That's kind of my point, Solr and Lucene put the index in memory.
Can't the index be shared across a cluster? Possibly with multiple replicas for a share to allow for servers to easily go up and down?
It can, but that would be slow and cost would be prohibitive.
Really? It's just a bit of text. RAM is fairly cheap these days.
not the entire index, only what is needed for each query.
IIRC they have a pretty standard PostGres setup. I'd bet they just setup another PostGres shard replicated just for search, using an extension for the index. Doesn't require the index or working set to be in RAM.
Why do some people write Postgres as PostGres?
Because it's the database that came after Ingres.
https://en.wikipedia.org/wiki/Ingres_(database)

https://en.wikipedia.org/wiki/PostgreSQL

Ingres and Postgres/PostgreSQL.

I still don't know /why/ some people say PostGres. I've seen it a few times, on HN and elsewhere, but its wrong isn't it?

Which isn't spelles InGres, is it?
Why do some people call Redis "read-is"? The world may never know...
I doubt that a succinct full text index (like an FM-index) of this data would require more than a modest server to keep in memory. Why aren't these used in this context?
But we're talking petabytes of text comments and an index of that would be a lot larger. How do you access that data fast enough to enable search?
Succinct full text indexes can be substantially smaller than the source text. It depends on the zero order entropy of the text. If things are highly repetitive, a very small index might be feasible. Usually lookup times are linearly proportional to query size, with logarithmic factors in database size.
I've yet to see such a system (in production) except for Sonic, but sonic doesn't allow for full-text search only search on a key-by-key basis.
You can rent from various cloud providers many VMs with 2TB of ram each.

AWS also offers some high memory machines with up to 24TB of ram in some locations.

Sure you can do that, but that's a hefty price per user.
Their job postings mention Solr:

> Experience with Solr or similar search technologies.

So I'm guessing that's most of it. Which makes sense given Full Text Search is the point of Solr/Lucene!

inverted indexes do not require that much overhead
can't use scylladb as a search engine