|
|
|
|
|
by josefcullhed
1553 days ago
|
|
The index we are running right now are all URLs in commoncrawl from 2021 but only URLs with direct links to them. This is mostly because we would need more servers to index more URLs and that would increase the cost. It takes us a couple of days to build the index but we have been coding this for about 1 year. All the indexes are on disk. |
|
Love it. Makes for a cheaper infrastructure, since SSD is cheaper than RAM.
>> It takes us a couple of days to build the index
It's hard for me to see how that could be done much faster unless you find a way to parallelize the process, which in itself is a terrifyingly hard problem.
I haven't read your code yet, obviously, but could you give us a hint as to what kind of data structure you use for indexing? According to you, what kind of data structure allows for the fastest indexing and how do you represent it on disk so that you can read your on-disk index in a forward-only mode or "as fast as possible"?