Hacker News new | ask | show | jobs
by Webster 2216 days ago
Substructure and exact match molecule searches are example of graph matching algorithms
1 comments

While true (I make a living working with them), I don't think the OP is that interested in graphs which are that small. What makes molecular graphs a bit more challenging is that there are 1 billion or so "small" graphs to search in a very large database. There are also important issues in how to represent a molecule as a molecular graph. Concepts like stereochemistry are tricky to handle.

"Search" here means subgraph isomorphism search.

No one that I know of uses a graph isomorphism test to compare two structures. Instead, they convert the molecular graph into a string, where the atoms and bonds are in a canonical order - perhaps a SMILES string or InChI string - and use a string search. Graph canonicalization is therefore important for molecular search, but not for the "complicated graphs, say social networks" the OP mentioned.

One specialized topic I worked on was the maximum common substructure problem. Given two molecules, what is the largest subgraph. Now extend that to N molecules. Now, find the largest subgraph which is in at least P% of those molecules.

That sounds interesting. Have you published anything?
On 2D molecular similarity search, yes. https://jcheminf.biomedcentral.com/articles/10.1186/s13321-0... . The general approach reduces a 2D molecular graph into a 1D bitstring "fingerprint" which is representative of the graph, and such that bit-wise similarity (Jaccard-Tanimoto) of the fingerprint gives a useful proxy estimation of molecular similarity.

For the rest? No. I'm primarily a software developer, not paper author.

There's my co-authored paper at https://pubs.acs.org/doi/10.1021/acs.jcim.8b00173 , which depends a lot on (sub)graph canonicalization.

The maximum common substructure algorithm I developed, for a set of compounds, is part of the RDKit, "fmcs". The original pure-Python implementation is at https://bitbucket.org/dalke/fmcs/src . It's been rewritten since then in C++ for the RDKit.

The Python code is easier to understand.