| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by josefcullhed 1553 days ago

The index we are running right now are all URLs in commoncrawl from 2021 but only URLs with direct links to them. This is mostly because we would need more servers to index more URLs and that would increase the cost.

It takes us a couple of days to build the index but we have been coding this for about 1 year.

All the indexes are on disk.

1 comments

kreeben 1553 days ago

>> All the indexes are on disk.

Love it. Makes for a cheaper infrastructure, since SSD is cheaper than RAM.

>> It takes us a couple of days to build the index

It's hard for me to see how that could be done much faster unless you find a way to parallelize the process, which in itself is a terrifyingly hard problem.

I haven't read your code yet, obviously, but could you give us a hint as to what kind of data structure you use for indexing? According to you, what kind of data structure allows for the fastest indexing and how do you represent it on disk so that you can read your on-disk index in a forward-only mode or "as fast as possible"?

link

josefcullhed 1553 days ago

Yes it would be impossible to keep the index in RAM.

>> It's hard for me to see how that could be done much faster unless you find a way to parallelize the process

We actually parallelize the process. We do it by separating the URLs to three different servers and indexing them separately. Then we just make the searches on all three servers and merges the result URLs.

>> I haven't read your code yet, obviously, but could you give us a hint as to what kind of data structure you use for indexing?

It is not very complicated, we use hashes a lot to simplify things. The index is basically a really large hash table with the word_hash -> [list of url hashes] Then if you search for "The lazy fox" we just take the intersection between the three lists of url hashes to get all the urls which have all words in them. This is the basic idea that is implemented right now but we will of course try to improve.

details are here: https://github.com/alexandria-org/alexandria/blob/main/src/i...

link

kreeben 1553 days ago

I realize I'm asking for a free ride here, but could you explain what happens after the index scan? In a phrase search you'd need to intersect, union or remove from the results. Are you using roaring bitmaps or something similar?

link

josefcullhed 1553 days ago

We are currently just doing an intersection and then we make a lookup in a forward index to get the urls, titles and snippets.

I actually don't know what roaring bitmaps are, please enlighten me :)

link

kreeben 1553 days ago

If you are solely supporting union or solely supporting intersection then roaring bitmaps is probably not a perfect solution to any of your problems.

There are some algorithms that have been optimized for intersect, union, remove (OR, AND, NOT) that work extremely well for sorted lists but the problem is usually: how to efficiently sort the lists that you wish to perform boolean operations on, so that you can then apply the roaring bitmap algorithms on them.

https://roaringbitmap.org/

link

marginalia_nu 1552 days ago

Roaring Bitmaps are awesome. I use them when merging indices. I need to know which items to keep from the old index, so I'm calculating the intersection between two sets of a cardinality around 500,000,000. Without breaking a sweat.

link