|
|
|
|
|
by n1xis10t
30 days ago
|
|
You’re Canadian? That’s pretty hilarious, I am too. It must be something they put in the Timbits. For promotion, I’d recommend picking the most technically interesting part of your implementation, something that’s really clever, and then making a one page writeup for Paged Out magazine about it (https://pagedout.institute/). They regularly have interesting stuff to read, and they have a pretty decent amount of readers. You could write something longer and send it in to 2600 magazine too, they’d probably be interested even if it was an overview of the project. Maybe the engine should be bigger first though so people are more enthused when they try it. I think 1 billion pages is around where a search engine starts to seem more normal: that’s about how much Marginalia has. How much space on disk does your index take up right now? Would you say the bottleneck is more the hard drive space or the crawling speed? |
|
My disk space situation is very bad. Like I said, I’m running on a Beelink EQR5 with only 500GB of storage. If my estimates are right, which they probably aren’t, it would take around 5TB for 50 million documents. So maybe 150TB or so should be enough for a billion.
Crawling speed is also a limiting factor, although not because of network speed. My internet is fine, but my crawler currently only crawls at around 1 million pages a day. It should have crawled around 30 million pages by now since it’s been running for over a month, but when I check the batches it’s produced, it’s probably closer to 2 million max, so there’s clearly a major issue somewhere.
Along with that bug, I also need to make the batch processor much faster. It processes documents and also adds embeddings using BERT, which takes up a significant amount of time. So it doesn’t index 1 million a day, maybe only 30k/day which is obviously something I really need to improve.
If the project keeps growing, I’ll probably eventually move to something much better than my current setup.