Hacker News new | ask | show | jobs
by sheraz 3905 days ago
I don't have anything public, but I have been exploring strategies for gluing together different tech in order to accomplish our goals. Latest stack has been:

- wget / wpull / heretrix to produce .warcs across a single domain - have a filewatcher on a folder to process .warc into text and then push it into elasticsearch with relevant metadata - flask search frontend for querying / results

Happy to share my learnings elsewhere. (I pinged you on email)