Interpretable and Scalable Similarity Metrics for DNA‐Encoded Library Design Using Generative Topographic Mapping

Louis Plyer, Alexey A. Orlov, Tagir N. Akhmetshin, Erik Yeghyan, Fanny Bonachera, Dragos Horvath, Alexandre Varnek

Molecular Informatics

DOI: 10.1002/minf.70026

Abstract

The growing number and size of DNA‐encoded libraries (DELs), together with the vast space of possible DEL designs, demand interpretable and scalable criteria for selecting which libraries to construct and screen against a given target. An ideal target‐focused DEL shows both strong similarity with an active reference compound collection and high intra‐DEL diversity. Chemography with Generative Topographic Mapping (GTM) was shown to be a promising approach for selecting DELs, offering both intuitive visualization and fast quantitative analysis scalable to thousands of DEL designs. This is achieved by defining each library by a “stand‐alone” vector, the comparison of which precludes costly pairwise inter‐molecular similarity calculations. However, the extent to which such “stand‐alone” (SA) approaches in general, and GTM‐derived SA metrics in particular, recover DELs that are reference‐proximal and chemically diverse as evaluated by conventional compound pair‐matching (CP) metrics in the initial descriptor space remains insufficiently characterized. In this article, the comparative analysis of the Morgan count fingerprint‐based chemical‐library similarity versus GTM‐derived metrics, using 100 diverse DEL subsets and a reference set of compounds tested against cyclin‐dependent kinase 2 (CDK2) from ChEMBL, was performed. GTM‐based SA metrics provide robust approximations for “gold standard” molecular descriptor space CP metrics for DEL selection: Spearman rank correlations fall in the 0.6–0.7 range. Our results demonstrate that GTM helps to identify DELs that best span the reference space according to same “gold standard” molecular descriptor space metrics: SA GTM‐driven rankings of libraries achieve enrichment factors at 5% (EF5%) of 4–12 (in terms of finding “gold standard” top libraries within the 5% best ranked by GTM)—always picking 2 out of the top 3 libraries. The accompanying two‐dimensional landscapes make intra‐ and interlibrary diversity visually accessible, supporting rapid, interpretable screening of alternative DEL designs. Collectively, these results position GTM as an efficient tool for chemical‐library similarity assessment and target‐focused DEL selection.

logo
logo