Hacker News new | ask | show | jobs
by zokier 3010 days ago
> After a month I had most newsgroups - excepting binaries - and it came to 800GB.

> Trying to index THAT lot was impossible

Stupid question, but why would indexing 800GB of newsgroup postings be impossible?

1 comments

Clucene was way too slow for body text, more than 1GB. I had my own header parser in C++ (though you can do that in Python easily).

I'm trying again on that 800GB with KISS DB (append-only hashtable), and Elasticsearch. Doesn't matter if GPL because it's a website.

Do you mind sharing the code ? I think that is an interesting thing to see