| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by atrettel 161 days ago

I recently wrote a command-line full-text search engine [1]. I needed to implement an inverted index. I choose what seems like the "dumb" solution at first glance: a trie (prefix tree).

There are "smarter" solutions like radix tries, hash tables, or even skip lists, but for any design choice, you also have to examine the tradeoffs. A goal of my project is to make the code simpler to understand and less of a black box, so a simpler data structure made sense, especially since other design choices would not have been all that much faster or use that much less memory for this application.

I guess the moral of the story is to just examine all your options during the design stage. Machine learning solutions are just that, another tool in the toolbox. If another simpler and often cheaper solution gets the job done without all of that fuss, you should consider using it, especially if it ends up being more reliable.

[1] https://github.com/atrettel/wosp

3 comments

zahlman 154 days ago

> I choose what seems like the "dumb" solution at first glance: a trie (prefix tree).

> There are "smarter" solutions like... hash tables.... A goal of my project is to make the code simpler to understand and less of a black box, so a simpler data structure made sense, especially since other design choices would not have been all that much faster or use that much less memory for this application.

Strangely, my own software-related answer is the opposite for the same reason.

I was implementing something for which I wanted to approximate a https://en.wikipedia.org/wiki/Shortest_common_supersequence , and my research at the time led me to a trie-based approach. But I was working in Python, and didn't want to actually define a node class and all the logic to build the trie, so I bodged it together with a dict (i.e., a hash table).

link

bawis 161 days ago

What body of knowledge (books, tutorials etc) did you use while developing it?

link

atrettel 161 days ago

Before I started the project, I was already vaguely familiar with the notion of an inverted index [1]. That small bit of knowledge meant that I knew where to start looking for more information and saved me a ton of time. Inverted indices form the bulk of many search engines, with the big unknown being how you implement it. I just had to find an adequate data structure for my application.

To figure that out, I remember searching for articles on how to implement inverted indices. Once I had a list of candidate strategies and data structures, I used Wikipedia supplemented by some textbooks like Skiena's [2] and occasionally some (somewhat outdated) information from NIST [3]. I found Wikipedia quite detailed for all of the data structures for this problem, so it was pretty easy to compare the tradeoffs between different design choices here. I originally wanted to implement the inverted index as a hash table but decided to use a trie because it makes wildcard search easier to implement.

After I developed most of the backend, I looked for books on "information retrieval" in general. I found a history book (Bourne and Hahn 2003) on the development of these kind of search systems [4]. I read some portions of this book, and that helped confirm many of the design choices that I made. I actually was just doing what people traditionally did when they first built these systems in the 1960s and 1970s, albeit with more modern tools and much more information on hand.

The harder part of this project for me was writing the interpreter. I actually found YouTube videos on how to write recursive descent parsers to be the most helpful there, particular this one [5]. Textbooks were too theoretical and not concrete enough, though Crafting Interpreters was sometimes helpful [6].

[1] https://en.wikipedia.org/wiki/Inverted_index

[2] https://doi.org/10.1007/978-3-030-54256-6

[3] https://xlinux.nist.gov/dads/

[4] https://doi.org/10.7551/mitpress/3543.001.0001

[5] https://www.youtube.com/watch?v=SToUyjAsaFk

[6] https://craftinginterpreters.com/

link

bawis 160 days ago

Thanks for detailing, how much time you invested in it?

link

atrettel 160 days ago

I spent around 170 hours on this so far, with only 60% of that being coding. The rest was mostly research or writing.

link

an-allen 154 days ago

Similar I have a script that has the following format: “q replace all onstances of http: with https: in all txt files recurisvely”

And it goes the ChatGPT comes back with and runs the appropriate command.

link