| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tmikaeld 1570 days ago
	Interesting, I wonder what they used for the search database, since that enormous amount of text can't fit in RAM, it would have to be partitioned & sharded in something like Scylla DB

7 comments

lovelearning 1570 days ago

At one point, their search used LucidWorks Fusion [1], a commercial product based around Solr (that uses Lucene indexes under the hood) but also integrates a vector database and the like for semantic search. The linked wiki page still has Lucene-style queries.

[1]: https://lucidworks.com/customers/reddit/

tmikaeld 1570 days ago

That's kind of my point, Solr and Lucene put the index in memory.

omegalulw 1569 days ago

Can't the index be shared across a cluster? Possibly with multiple replicas for a share to allow for servers to easily go up and down?

tmikaeld 1569 days ago

It can, but that would be slow and cost would be prohibitive.

kevincox 1569 days ago

Really? It's just a bit of text. RAM is fairly cheap these days.

liveoneggs 1569 days ago

not the entire index, only what is needed for each query.

winrid 1570 days ago

IIRC they have a pretty standard PostGres setup. I'd bet they just setup another PostGres shard replicated just for search, using an extension for the index. Doesn't require the index or working set to be in RAM.

philliphaydon 1569 days ago

Why do some people write Postgres as PostGres?

__alexs 1569 days ago

Because it's the database that came after Ingres.

philliphaydon 1569 days ago

https://en.wikipedia.org/wiki/Ingres_(database)

https://en.wikipedia.org/wiki/PostgreSQL

Ingres and Postgres/PostgreSQL.

I still don't know /why/ some people say PostGres. I've seen it a few times, on HN and elsewhere, but its wrong isn't it?

stavros 1569 days ago

Which isn't spelles InGres, is it?

winrid 1569 days ago

Why do some people call Redis "read-is"? The world may never know...

inciampati 1570 days ago

I doubt that a succinct full text index (like an FM-index) of this data would require more than a modest server to keep in memory. Why aren't these used in this context?

tmikaeld 1570 days ago

But we're talking petabytes of text comments and an index of that would be a lot larger. How do you access that data fast enough to enable search?

inciampati 1569 days ago

Succinct full text indexes can be substantially smaller than the source text. It depends on the zero order entropy of the text. If things are highly repetitive, a very small index might be feasible. Usually lookup times are linearly proportional to query size, with logarithmic factors in database size.

tmikaeld 1569 days ago

I've yet to see such a system (in production) except for Sonic, but sonic doesn't allow for full-text search only search on a key-by-key basis.

speedgoose 1570 days ago

You can rent from various cloud providers many VMs with 2TB of ram each.

AWS also offers some high memory machines with up to 24TB of ram in some locations.

tmikaeld 1570 days ago

Sure you can do that, but that's a hefty price per user.

gilbetron 1569 days ago

Their job postings mention Solr:

> Experience with Solr or similar search technologies.

So I'm guessing that's most of it. Which makes sense given Full Text Search is the point of Solr/Lucene!

datalopers 1570 days ago

inverted indexes do not require that much overhead

ddorian43 1570 days ago

can't use scylladb as a search engine