> A reminder that BigQuery (as used in the query in this link) is the best way to play with Hacker News data; don't scrape HN data manually!
The `bigquery-public-data.hacker_news.full` table appears to be up to date with the most recent HN data as well (table last updated today).
However, I'm not 100% sure the query is correct for unilaterally getting all links, as running the query on the full dataset returns the same results as running it from 2006-2015. And I value my sanity enough to not fuss around with the regex.
From there we ran them through an embedding model and indexed the embeddings in Pinecone.
The actual similarity search is done with Pinecone. (https://www.pinecone.io)