|
|
|
|
|
by q3k
1736 days ago
|
|
Good stuff. I've also been toying with doing some homegrown search engine indexing (as an exercise in scalable systems), and this is a fantastic result and great inspiration. Definitely want to see more people doing that kind of low-level work instead of falling back to either 'use elasticsearch' or 'you can't, you're not google'. |
|
For the moment I have just south of 20 million URLs indexed.
1 x 20 million bytes = 20 Mb.
10 x 20 million bytes = 200 Mb.
100 x 20 million bytes = 2 Gb.
1,000 x 20 million bytes = 20 Gb.
10,000 x 20 million bytes = 200 Gb.
100,000 x 20 million bytes = 2 Tb.
1,000,000 x 20 million bytes = 20 Tb.
This is still within what consumer hardware can deal with. It's getting expensive, but you don't need a datacenter to store 20 Tb worth of data.
How many bytes do you need, per document, for an index? Do you need 1 Mb of data to store index information about a page that, in terms of text alone, is perhaps 10 Kb?