| HN Mirror

You bet!

There's three major phases to the document matching: Corpus Management, Frequency Modeling, and Matching & Clustering.

1. Corpus Management

- First we pull the raw articles

- Then we build a bag-of-words document, based on our overall dictionary, so we end up with a bunch of numbers to represent all the words and their frequencies for each document.

- We store all the stories for that day as bag-of-word, in a Matrix Market format.

- The day's .mm file is updated every few minutes into an S3 bucket, along with the dictionary. This way I can easily compose a corpus of documents for a single day, multiple days, or all time

- I use the Gensim library for Python to do most of the above.

2. Frequency Modeling

- As stories come in throughout the day, I periodically refresh a TF-IDF model for the entire corpus of stories.

- TF-IDF just allows us to see which terms happens frequently in a story, but relatively infrequently across all stories. So a word like "Hacker" would be relevant if it appears frequently, but "the" would not since it occurs across all documents.

- TF-IDF modeling also uses the Gensim library.

- TF-IDF model is stored in another S3 bucket, to be pulled on update by the Matching & Clustering job.

3. Matching & Clustering

- First we fit a limited corpus (usually the last 10k or so stories) to the TF-IDF model, and keep that as a sparse matrix.

- This gives us the ability to quickly determine the per-word importance of a document, and to represent that as a vector.

- Next we do a simple cosine similarity of the 10k documents against themselves. This tells us how much of an angle there is between each document vector against each other document vector. In other words, it's a measure of similarity.

- We limit all of this to 10k documents, because it would be computationally prohibitive to compare ALL stories to ALL OTHER stories. Since most news stories are published relatively close to one another, we only have to compare recent stories. 10k seems to produce good results. We can further shard this data set by news category if we need to (i.e., compare only sports stories or political stories).

- Next we use SciPy, and create a Ward linkage matrix.

- Next, also using SciPy, we use fcluster to do agglomerative clustering. These last two steps produce a cluster tree that puts similar stories together.

- Finally, we slice apart the clustered data set at a certain similarity height. Kind of like cutting through a head of broccoli above the stalk. The clusters we're left with are the most similar articles of that batch of 10k.

After all that is done, we just store associations between articles that were determined to be similar to one another. Since we're constantly running this process, the 10k story window keeps sliding forward, so you're able to store similarities of stories that are similar to far older stories, provided there is another story in-between that both are similar to. For example, if story 12,000 is similar to story 9,000, and story 9,000 is similar to story 10, you'll end up storing that story 10 is similar to story 12,000, even though you didn't directly compare them.