| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by epoch_100 2343 days ago

No Google queries, but also only very limited web crawling (i.e. _only_ to fill in gaps for Reddit and HN). There are a plethora of freely available archives of web content released almost hourly, so it's possible to scan a large slice of the web without too much resource expenditure. Feel free to message me for more technical info, happy to share whatever.

Here's the Rust library I built that powers the core of my project, if it's any help: https://github.com/milesmcc/ieql

Relatedly, because I feel like my tool relies so much on open-source software and public archives, I give 10% of my service's revenue to various open source projects. Tools like these are only possible at this price point if they're built on the shoulders of giants (i.e. tons of open source software and publicly available web archives).

1 comments

ksahin 2343 days ago

What public available web archive are you referring to? I know CommonCrawl but it's a monthly archive.

link

epoch_100 2343 days ago

CommonCrawl is monthly, but they also release several news archives daily (and the news archives contain a surprising amount of content). Plus, there are lots and lots of RSS feed archives floating around.

link