Hacker News new | ask | show | jobs
by nathancahill 3903 days ago
Do you have any links to share? I'm working on a side project for vertical search for programmers. Curating sites to crawl with source code, docs, mailing lists, QA, IRC and tutorials.

Trying to get away from the "W3Schools effect" [0], where outdated, terribly presented information or downright spammy pages are locked in the top results of Google by virtue of being around for so long, or by gaming search keywords [1].

[0] https://github.com/nathancahill/fuck-w3schools

[1] http://www.bigresource.com/

1 comments

I don't have anything public, but I have been exploring strategies for gluing together different tech in order to accomplish our goals. Latest stack has been:

- wget / wpull / heretrix to produce .warcs across a single domain - have a filewatcher on a folder to process .warc into text and then push it into elasticsearch with relevant metadata - flask search frontend for querying / results

Happy to share my learnings elsewhere. (I pinged you on email)