| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by santamex 1736 days ago
	Which software do you use to index the sites?

1 comments

marginalia_nu 1736 days ago

I wrote it myself from scratch. I have some metadata in mariadb, but the index is bespoke.

A design sketch of the index is that it uses one file with sorted URL IDs, one with IDs of N-grams (i.e. words and word-pairs) referring to ranges in the URL file; as well as a dictionary for relating words to word-IDs; that's a GNU Trove hash map I modified to use memory map data instead of direct allocated arrays.

So when you search for two words, it translates them into IDs using the special hash map, goes to the words file and finds the least common of the words; starts with that.

Then it goes to the words file and looks up the URL range of the first word.

Then it goes to the words file and looks up the URL range of the second word.

Then it goes through the less common word's range and does a binary search for each of those in the range of the more common word.

Then it grabs the first N results, and translates them into URLs (through mariadb); and that's your search result.

I'm skipping over a few steps, but that's the very crudest of outlines.

q3k 1736 days ago

Good stuff. I've also been toying with doing some homegrown search engine indexing (as an exercise in scalable systems), and this is a fantastic result and great inspiration.

Definitely want to see more people doing that kind of low-level work instead of falling back to either 'use elasticsearch' or 'you can't, you're not google'.

marginalia_nu 1736 days ago

Well just crunching the numbers should indicate what is possible and what isn't.

For the moment I have just south of 20 million URLs indexed.

1 x 20 million bytes = 20 Mb.

10 x 20 million bytes = 200 Mb.

100 x 20 million bytes = 2 Gb.

1,000 x 20 million bytes = 20 Gb.

10,000 x 20 million bytes = 200 Gb.

100,000 x 20 million bytes = 2 Tb.

1,000,000 x 20 million bytes = 20 Tb.

This is still within what consumer hardware can deal with. It's getting expensive, but you don't need a datacenter to store 20 Tb worth of data.

How many bytes do you need, per document, for an index? Do you need 1 Mb of data to store index information about a page that, in terms of text alone, is perhaps 10 Kb?

freediver 1733 days ago

What crawler are you using and what kind of crawling speeds are you achieving?

How do you rank the results (is it based on content only) or you have external factors too?

What is your personal preferred search option of the 7 and why?

Thanks for making something unique and sorry that despite all the hype this got, you got only $39/month on Patreon. It is telling in a way.

marginalia_nu 1733 days ago

> What crawler are you using and what kind of crawling speeds are you achieving?

Custom crawler, and I seem to get around 100 documents per second at best, maybe closer to 50 on average. Depends a bit on how many crawl-worthy websites it finds, and there is definitely diminishing returns as it goes deeper.

>How do you rank the results (is it based on content only) or you have external factors too?

I rank based on a pretty large number of factors, incoming links weighted by the "textiness" of the source domain, and similarity to the query.

> What is your personal preferred search option of the 7 and why?

I honestly use Google for a lot. My search engine isn't meant as a replacement, but a complement.

> Thanks for making something unique and sorry that despite all the hype this got, you got only $39/month on Patreon. It is telling in a way.

Are you kidding? I think the Patreon is a resounding success! I'm still a bit stunned. I've gotten more support and praise, not just in terms of money but also emails and comments here than I could have ever dreamed possible.

And this is just the start, too. I only recently got the search engine working this well. I have no doubt it can get much better. The fact that I have 11 people with me on that journey, even if they "just" pay my power bill, that's amazing.

I'm honestly a bit at a loss for words.

freediver 1733 days ago

You have a great attitude!

And I am not kidding. I think for something that got so much attention on HN, where realistically this kind of product can only exist for now, the 'conversion' rate was very low. Billion dollar companies were made of HN threads with lot less engagement. Makes me wonder do we really want a search engine like this or we just like the idea of it?

And what are the barriers to use something like this? You say yourself that you are using Google most of the time. Is jumping to check results on this engine going to be too much friction for most uses?

Can something like this exist in isolation? What kind of value would it need to provide for users to remember using it en-masse as an additional/primary vertical search like they do for Amazon?

Just thinking out-loud as I am also interested in the space (through http://teclis.com).

fuzzfactor 1733 days ago

The day Google first appeared on the full internet it was excellent of course because it had no ads.

Plus another excellent feature was you would get the same search results no matter who or where you were for quite some period of calendar time.

If something new did appear it was likely to be one of the new sites that was popping up all the time and it was likely to be as worthwhile as its established associates on the front page.

You shouldn't need to crawl nearly as fast if you can compensate by treading more suitably where those have gone before.

bullen 1733 days ago

Interesting, in my database (http://root.rupy.se) I have one file per word that contains the ids (long) of the nodes (URLs), so to search many words together I have to go through the first file and one by one see if I find matches in the second.

How does the range binary search work, does it just prune out the overlaps, how efficient is it and how much data do you have in there for say "hello" and "world" f.ex?

Aeolun 1735 days ago

I’m not sure how you go from word to url range? Range implies contiguous, but how can you make that happen for a bunch of words without keeping track of a list of urls for each word (or URL ids, the idea is the same)?

marginalia_nu 1735 days ago

The trick is that the list of URLs for each word already is in the URLs file.

The URLs in a range are sorted. A sorted list (or list-range) forms an implicit set-like data structure, where you can do binary searches to test for existence.

Consider a words file with two words, "hello" and "world", corresponding to the ranges (0,3), (3,6). The URLs file contains URLs 1, 5, 7, 2, 5, 8.

The first range corresponds to the URLs 1, 5, 7; and the second 2, 5, 8.

If you search for hello world, it will first pick a range, the range for "hello", let's say (1,5,7); and then do binary searches in the second range -- the range corresponding to "world" -- (2,5,8) to find the overlap.

This seems like it would be very slow, but since you can trivially find the size of the ranges, it's possible to always do them in an order of increasing range-sizes. 10 x log(100000) is a lot smaller than 100000 x log(10)

bullen 1733 days ago

Hm, ok I understand more but how do you perform the "binary search", just loop over the URL ids?

Funny I also selected "hello" and "world" above! Xo

My system is also written in Java btw!

Here are example results of my word search:

http://root.rupy.se/node/data/word/four

http://root.rupy.se/node/data/word/only

etc.

marginalia_nu 1733 days ago

I'll get back to your email in a while, I've got a ton of messages I'm working through.

But yeah, in short pseudocode:

  for url in range-for-"hello":
    if binary-search (range-for-"world", url):
      yield url

I do use streams, but that is the bare essence of it.

soheil 1731 days ago

So every time you insert a new URL for a word you have to update the range for every other single word since the URL file will be shifted?

soheil 1731 days ago

Are the n-grams always at most n=2 bigrams?

marginalia_nu 1731 days ago

No, I actually count the n-grams as distinct words (up to 4-grams). The main limiter is for that is space, so I only extract "canned" n-grams from some tags.

I would first search for the bigram hello_world, that's an O(1) array lookup; as then documents merely containing the words hello and world (usually not a good search result), that's the algorithm I'm describing in the parent comment.

soheil 1731 days ago

Makes sense. Every time you insert a new URL for a word you have to update the ranges for every other word since the URL file will be shifted?

rvnx 1736 days ago

It's a great project!