| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vigna 3346 days ago

I don't how this project ended up here in this moment, but as one of the authors let me answer the main questions.

1) The name is just a coincidence. I learned originally about indexing from the "Managing Gigabytes" book, and that's the reason for the name, but the book is now completely obsolete, and, even at that time, it contained a significant number of red herrings. There's no connection or code or idea sharing of any kind.

2) MG4J is our playground for doing research in information retrieval. This means, for example, that we designed new data structures, such as Elias-Fano indexing, which make MG4J have ridiculously faster times in benchmarks (see https://github.com/lintool/IR-Reproducibility). Elias-Fano is now the main Facebook indexing algorithm and it is slowly percolating to Lucene (look in the sources).

3) You can define your queries using a very rich interval language with a very fast implementation based on new algorithms. You can easily create parallel indices with text and tagging and ask whether a phrase falls into an area tagged as "location", for example.

2) MG4J is a project of two people and at this time I'm the only maintainer. You cannot expect that it is refined as Lucene or Solr. But you can very easily hack into it (even without modifying the sources), which is why it has been popular with people experimenting with indexing. For example, there are many tools to manipulate index, splitting them with a specified strategy, combining them, etc.

3) So if you want an out-of-the-box solution for indexing, forget about it. If you want a fun playground for doing research or a very efficient backbone on which to build your infrastructure, MG4J might be useful to you. We used it recently for http://wikirank.di.unimi.it/ .