Generative Models for Chemical Structures

David White & Richard Wilson

The field of chemoinformatics is a relatively new discipline that studies the application of computational methods to chemistry. The scope of the field is wide ranging from providing databases of molecules that may be searched by molecular structure, to the simulation of molecular interactions which can help to predict how two molecules will react. From the beginning medicinal chemists have been using tools from this domain to help them design more effective drugs and better understand their properties. A combination of a better understanding of biochemistry and advances in chemoinformatics has moved the process of drug discovery away from refinements of natural products and serendipitious discoveries and towards rational drug design.

In this work we apply recently developed techniques for pattern recognition to construct a generative model for chemical structure. This approach can be viewed as ligand based de novo design. We construct a statistical model describing the structural variations present in a set of molecules which may be sampled to generate new structurally-similar examples. We prevent the possibility of generating chemically invalid molecules, according to our implicit hydrogen model, by projecting samples onto the nearest chemically valid molecule. Furthermore, by populating the input set with molecules that are active against a target, we show how new molecules may be generated that will likely also be active against the target.

This work was featured on the cover of the July 2010 edition of the Journal of Chemical Information and Modeling.