|
|
|
|
|
by dekhn
720 days ago
|
|
Interesting! That's analogous to some recent work in chemistry. Historically, there was a string representation for molecules called SMILES, which is fairly terse and, when canonicalized, maps from strings to individual molecules (2D topology). However, not all strings are valid SMILES. Recent work with autoencoders to turn SMILES strings into a vector representation via embedding creates models that often generate invalid SMILES strings (the popular paper about this glosses over this fact). For example, if your training set includes both bromine (represented as Br) and chlorine (represented as Cl) and you generate random vectors, they might decode to contain Bl, which is an as-yet unmade element. This is not desirable (although opinions vary). As a result, the group that published the earlier work developed a new compact representation known as SELFIES (https://github.com/aspuru-guzik-group/selfies) where it's effectively impossible to generate invalid decodings of strings (every SELFIE string decodes to a valid molecule). I'm not sure what the terminology for these sorts of features of encodings. |
|
This could be similar to mutations allow one to explore a wider range of options, although sometimes it can go too far and get a non-functional individual.
[1]: https://www.nature.com/articles/s42256-024-00821-x