|
Truly, amazing work! Not only because of the final results, but also because of the whole process it took the author to bring this to life. If I could upvote this by giving points from my karma, I wouldn't hesitate to easily give a hundred points. Without a doubt, I would classify this on par with "40k HN comments mentioning books, extracted using deep learning" (https://news.ycombinator.com/item?id=28595967), which is the highest-voted "Show HN" project related to hacker news so far with 1359 points. I'm not in the ML/AI arena yet, so I couldn't fully understand the second half of the article except for having a general idea about Embeddings and their potential, but the first part is what interests me as a software engineer. Following are some of the challenges the author came across, was able to overcome each of them, and published the full source code. Downloading HN database > There's also a maxitem.json API, which gives the largest ID. As of this writing, the max item ID is over 40 million. Even with a very nice and low 10 ms mean response time, this would take over 4 days to crawl, so we need some parallelism. > I've exported the HN crawler [1] (in TypeScript) to its own project, if you're ever in need to fetch HN items. Fetching and parsing linked URLs' HTML for metadata and text > For text posts and comments, the answer is simple. However, for the vast majority of link posts, this would mean crawling those pages being linked to. So I wrote up a quick Rust service [2] to fetch the URLs linked to and parse the HTML for metadata (title, picture, author, etc.) and text. This was CPU-intensive so an initial Node.js-based version was 10x slower and a Rust rewrite was worthwhile. Fortunately, other than that, it was mostly smooth and painless, likely because HN links are pretty good (responsive servers, non-pathological HTML, etc.). Recovering missing/dead links > A lot of content even on Hacker News suffers from the well-known link rot: around 200K resulted in a 404, DNS lookup failure, or connection timeout, which is a sizable "hole" in the dataset that would be nice to mend. Fortunately, the Internet Archive has an API that we can use to use to programmatically fetch archived copies of these pages. So, as a final push for a more "complete" dataset, I used the Wayback API to fetch the last few thousands of articles, some dating back years, which was very annoying because IA has very, very low rate limits (around 5 per minute). Finding a cost-effective cloud provider for GPUs > Fortunately, I discovered RunPod, a provider of machines with GPUs that you can deploy your containers onto, at a cost far cheaper than major cloud providers. They also have more cost-effective GPUs like RTX 4090, while still running in datacenters with fast Internet connections. This made scaling up a price-accessible option to mitigate the inference time required. This is the type of content that makes HN stands out from the crowd. _____________________________ 1. https://github.com/wilsonzlin/crawler-toolkit-hn/ 2. https://github.com/wilsonzlin/hackerverse/tree/master/crawle... |