| I wrote it myself from scratch. I have some metadata in mariadb, but the index is bespoke. A design sketch of the index is that it uses one file with sorted URL IDs, one with IDs of N-grams (i.e. words and word-pairs) referring to ranges in the URL file; as well as a dictionary for relating words to word-IDs; that's a GNU Trove hash map I modified to use memory map data instead of direct allocated arrays. So when you search for two words, it translates them into IDs using the special hash map, goes to the words file and finds the least common of the words; starts with that. Then it goes to the words file and looks up the URL range of the first word. Then it goes to the words file and looks up the URL range of the second word. Then it goes through the less common word's range and does a binary search for each of those in the range of the more common word. Then it grabs the first N results, and translates them into URLs (through mariadb); and that's your search result. I'm skipping over a few steps, but that's the very crudest of outlines. |
Definitely want to see more people doing that kind of low-level work instead of falling back to either 'use elasticsearch' or 'you can't, you're not google'.