Hacker News new | ask | show | jobs
by emrgx 3429 days ago
Alchemy is doing all the NLP. Each article is extracted for concepts and entities (as defined by Alchemy in their documentation). I normalize each term that is extracted in order to prevent duplicates (there are some duplicates that still sneak through so it still requires a little bit of data maintenance). So the way this looks is that their is one node for a term say "Machine Learning." In one article "Machine Learning" is a concept with a negative sentiment and high relevance and another article it is an entity with low relevance but positive sentiment. The relationships house the sentiment and relevance properties: (machine_learning)-[relevance,sentiment]-(article).

The suggested readings sections pulls the most relevant concept of that article and finds connected articles with the same concept at a high relevance. This way suggested articles are more than just key word hits. It's all about relevance. I'm still continuing to tweak this query and there's a lot more that can be done with it such as matching sentiment and emotion. As the dataset grows I'll look to add a feature that pulls a list of articles based on a cluster of highly associated entities.

As for Alchemy, I've tried a number of different NLP APIs and, in my opinion, none of them have come close to matching Alchemy's accuracy. It does make mistakes but at a low enough level that it's easy to manually correct.

1 comments

Thanks for the background. I'm working on a similar project but currently parsing news articles using a collection of specific rss feeds and calling Google's NLP API with the text. It sounds like AlchemyAPI seems be a better fit in this case.

How are you finding Neo4J is handling the scale of reading and writing all these stories? I've had a positive experience so far but I'm only in the few thousands range.

Neo4j handles read/write seamlessly I have found, but I'm only around 10,000 nodes and 20,000+ edges. I've heard use cases for Neo4j in the range of 50M+ nodes. My position on this is not whether Neo4j can handle it but whether your code and infrastructure can.