| HN Mirror

In general it scaled pretty well if you avoided loading tens of thousands of edges in a single call. A similar system was used on an app that would try to find connection strength between people using sent emails as signal. At it's peak the node table had tens of millions of rows, with some of the nodes (users) having thousands of edges each. The main pitfalls are

* loading too many edges (10K+) and associated nodes will be slow. * Traversing nodes in a meaningful way can be difficult.

To solve these, the schema has the following indices:

On edge table: (`from_node_id`,`type`,`updated`) On node_data table: (`type`,`data`(128))

Since edges rarely change, the first index allows you to paginate over edges by using updated as the order. As long as you request a reasonable number of edges, things should work OK. The second index is needed to get a node given some data, but a secondary use is sorting. By precomputing some score and saving it in node_data you can traverse nodes in that order (this is not currently built but is simple to do in SQL).

All this being said, the schema is pretty index heavy so if MySQL is forced to kick some of those out of memory it may lead to a bad time.

Thanks for the kind words on the model, I never thought about it being separate, but it makes 100% sense.