Hacker News new | ask | show | jobs
Building a Real-Time Recommendation Engine (neo4j.com)
108 points by levbrie 3585 days ago
3 comments

I tried Neo4j a while back for recommendations and calculating similarities between users but when running against our full dataset got too many OutOfMemory exceptions. Ended up with a Mahout / Spark solution. It's an awesome graph db though - can find many other uses for it.
Yeah, I'm surprised the Neo4j team hasn't made more of an effort on this. I've run into lots of memory issues with it as well, and although there are reliable, fairly straightforward solutions to most of these problems, the team doesn't seem to be particularly interested in making sure that the defaults are robust enough to handle a reasonable workload. When your database fails on you for making a reasonable query request on a light workload, you can't help but feel troubled. There's a lot to love about Neo4j, but they've got a lot of work to do if they want to win over the developer community as a whole. There may be enterprises that get reassured by a huge price tag and a whole bunch of salespeople at their beck and call, but I don't know any of them. Every engineer I know who is willing to pay for software is either expecting a completely new kind of product or expecting to have an awesome experience with a free version of the tool before being willing to commit even a few bucks a month.
Yeah I've tried a couple of times at getting Neo4j into stacks but the outcome has always been it's pretty much limited to baking relationship data pre/on demand that is saved elsewhere and cleared out otherwise you get into prohibitively expensive licensing / infrastructure territory very quickly.

At that point a more pragmatic solution has always won.

Exactly the same as you, I was just trying out neo4j today with a small dataset (30mb) and was getting memory exceptions trying to add a relationship.
Would you mind sharing the query? If you're hitting OOM exceptions with a dataset of that size there may be a typo in the query that's doing some sort of traveling salesman operation.

e.g.,

//grabs literally EVERY node in your database

MATCH (Person)-[KNOWS]-(Friend)

//only the people who have a KNOWS relationship between them

MATCH (person:Person)-[:KNOWS]-(Friend:Person)

the solution we are moving to is to use spark to compute similarities, etc and load it into a neo4j graph.

so we use neo4j for oltp and spark for the olap part.

Can you though? My impression is that it doesnt scale to large data sets. The use cases for true graph databases (over shaky implementations on HBase/Cassandra) sparse in my opinion.
6 of this, half a dozen of the other.

It's a single image database (no partitioning except in memory), so all nodes in the cluster will have the complete dataset (thus each node must be large enough to store it). However, because Neo4j doesn't rely on joins / table scans to operate-- traversals are O(1) not O(n). So there's an advantage to doing OLTP work on really really large datasets that have a specific starting point. Neo4j will do pointer arithmetic instead of scans / joins, such that regardless of dataset size a query will only access the fixed amount of data. The reason for this strategy has been that scale up hardware pricing has come down incredibly quickly in the last decade and having a trio of 64+++ GB memory boxes isn't out of the question for most mid-size and enterprise companies. Secondly, distributed systems are non-trival problems to manage both from a development but a devops perspective as well.

The philosophy of the Neo4j team is to conquer the world slowly. In order of priority Neo4j is designed around:

1.) data integrity and availability (ACID transactions, master-slave replication)

2.) rapid reads for graph traversals

3.) ability to store web-scale datasets (trillions++ of nodes)

4.) parallel operations (multi-master, map-reduce, global analytics, etc.)

The product has firmly completely 1 and 2, and is starting to work on 3 and 4 (4 mostly with a databricks / spark partnership).

It fights the same CAP problem that all databases do. We've chosen Consistency and Availability. Partition tolerance just isn't something inherent to graph databases. We can do some really smart math and duplicate nodes with high betweenness centrality (data nodes, not servers) or shuffle data based on access patterns to prevent introducing network latency into query plans that access nodes on multiple partitions. But doing that while maintaining 1 and 2 of the above is very not easy.

Disclaimer:

MATCH (rhino)-[:WORKS_AT]->(neo4j)

WHERE NOT rhino.opinions = neo4j.opinions

Anybody tried OrientDB for recommendation engine?
What a ridiculous title. Compare: "building a hardwood coffee table with woodworking"

Gotta get those keywords in for clicks

I agree completely but I guess I also don't mind it. But I also wouldn't mind "The Art of Woodworking: Tables You Can Build Yourself" - I suppose it depends on your tolerance for buzz words. I'm bombarded with them all day long so perhaps my tolerance is growing.
To be frank, the way I submit conference talks is normally:

Understanding {buzzword pop culture / news topic} using {conference language} + {pick 3+ of [real-time, cloud, at scale, data science, docker, sentiment analysis, polyglot persistence]}

It works shamefully well.

Ok, we took "with data science" out of the title above.
It's not your fault. It's the blog's
Il faut cultiver notre jardin.