|
|
|
|
|
by nathancahill
3903 days ago
|
|
Do you have any links to share? I'm working on a side project for vertical search for programmers. Curating sites to crawl with source code, docs, mailing lists, QA, IRC and tutorials. Trying to get away from the "W3Schools effect" [0], where outdated, terribly presented information or downright spammy pages are locked in the top results of Google by virtue of being around for so long, or by gaming search keywords [1]. [0] https://github.com/nathancahill/fuck-w3schools [1] http://www.bigresource.com/ |
|
- wget / wpull / heretrix to produce .warcs across a single domain - have a filewatcher on a folder to process .warc into text and then push it into elasticsearch with relevant metadata - flask search frontend for querying / results
Happy to share my learnings elsewhere. (I pinged you on email)