Hacker News new | ask | show | jobs
by sohei 1822 days ago
How did you gather the comment histories? Would you mind sharing a copy?
2 comments

See description at the bottom. We used the Hacker News API to pull data into BigQuery.

From there we ran them through an embedding model and indexed the embeddings in Pinecone.

The actual similarity search is done with Pinecone. (https://www.pinecone.io)

Using Google BigQuery is one way. This comment might be of use:

https://news.ycombinator.com/item?id=25075318

> A reminder that BigQuery (as used in the query in this link) is the best way to play with Hacker News data; don't scrape HN data manually! The `bigquery-public-data.hacker_news.full` table appears to be up to date with the most recent HN data as well (table last updated today). However, I'm not 100% sure the query is correct for unilaterally getting all links, as running the query on the full dataset returns the same results as running it from 2006-2015. And I value my sanity enough to not fuss around with the regex.