Pre-clinical drug discovery and development generates a huge amount of data, which cannot be evaluated manually. Chemical structure and biological activity data from high-throughput screenings for drug discovery, the multitude of biological and toxicological assay data produced during drug development as well as analytical data from quality control and in vivo testing need computational techniques to turn the data into information. For instance, it would be valuable for the evaluation of series of activity data to relate those data to structural changes of the respective molecules to derive so-called structure-activity relationships. Machine learning provides the necessary tools to extract this information from the data. Prerequisites are that the chemical structures can be organized and processed efficiently. If this is the case, it is possible to detect patterns in which way structural changes affect bioactivity. This information can then be used to design novel molecules with improved properties. Chemoinformatics is the scientific discipline that deals with the efficient data handling and data evaluation of the chemical data space. Its focus is on preclinical drug discovery and drug development and represents our main research area.
Encoding chemical structures. Molecules cannot be fed into machine learning tools without encoding them. The chemical structure needs to be transformed into a numerical description of the molecule to develop mathematical models that relate chemical structures to biological activities. The mathematical disciplines of graph theory and geometry, among others, provide techniques to encode molecules. The resulting numerical representation is called molecular descriptor. Molecular descriptors can be used for virtual high-throughput screening, for visualizing chemical libraries, for the analysis of quantitative structure-activity relationships, and for the prediction of a molecule’s target structure.
Figure: Ligand-based and structure-based pharmacophore models of the kinase CDK2 (Source: Dissertation F. Kölling
Development and validation of chemoinformatic models. Most techniques of machine learning can be used to develop chemoinformatic models. The employed molecular descriptor is of utmost importance for the success model building. If the numerical description of the molecule is unsuitable for the purpose, good results are rather unlikely. Since molecular descriptors are mostly complex and high dimensional descriptions of chemical molecules, data analysis may be prone to chance correlation and overfitting. Rigorous validation and assessment of the resulting models is therefore essential to exclude seemingly good models that would perform badly on future molecules.
Evaluation of analytical and bioanalytical data. Bio(-analytical) data often need tailored preprocessing and modelling techniques. Vibrational spectroscopy data (IR, NIR, Raman spectroscopy) often contain spikes, scattering light, or baseline drift. These interfering signals need to be removed for extracting meaningful information from the spectra. Moreover, dimension reduction and the selection of important wavelength regions are important tools to improve data evaluation. The reproducible and objective evaluation of modern microscopic techniques also challenge the data analyst owing to the sheer amount of data. For the detection of exocytosis, for instance, many different data processing steps are necessary to solve this seemingly simple pattern recognition task.
Figure: Intensity profiles of fluorescent-labeled insulin granules. The simultaneous tracking of all intensity profiles is needed for the detection of insulin granule exocytosis. (Source: Dissertation M. Matz)