Vasanthanathan Poongavanam , S. Pauliina Turunen , Kristian Sandberg , Ulrika Yngve , Johan Wannberg
Drug Discovery Today
DOI: 10.1016/j.drudis.2026.104629
Abstract
DNA-encoded libraries (DELs) combined with machine learning (ML) offer a powerful paradigm for hit identification. However, sequencing-derived enrichment data are inherently noisy and biased, often resulting in models that overfit to specific chemical libraries. In this review, we critically evaluate the capabilities and limitations of DEL-ML, illustrating key challenges using Aurora Kinase A (AURKA) DEL affinity selection data. We demonstrate that standard ML models often struggle to generalize to unseen chemical space because of the specific structural constraints of combinatorial libraries. Furthermore, we discuss the necessity of rigorous denoising strategies and evaluate approaches, such as domain adaptation, to mitigate these limitations, offering a roadmap for building robust models capable of exploring diverse chemical space.
Summary
This review critically examines the integration of machine learning (ML) with DNA-encoded library (DEL) technology for drug discovery. While DEL-ML offers a powerful paradigm for hit identification by generating massive binding datasets (10⁶–10¹² data points), the authors identify a critical "generalizability gap" that limits the practical utility of current models. Using Aurora Kinase A (AURKA) as a case study with OpenDEL 4.0 screening data (~1.5 million data points), the authors demonstrate that standard ML models achieve high accuracy on internal validation but frequently fail to generalize to structurally novel scaffolds due to domain shift—the substantial difference between DEL chemical space and known pharmacological compounds. The review provides methodological best practices for data preprocessing, denoising, and validation, while evaluating advanced strategies such as domain adaptation to improve model robustness. The authors argue that future DEL-ML development must move beyond simple accuracy maximization toward explicit handling of distribution shifts to transform DEL-ML from a retrospective analysis tool into a reliable engine for novel chemical discovery.
Highlights
1. The Generalizability Challenge in DEL-ML
2. Data Quality and Preprocessing Considerations
3. Class Imbalance and Data Splitting Strategies
4. Molecular Representation and Model Architectures
5. Domain Adaptation as a Solution Strategy
6. AURKA Case Study Findings
Conclusion
The integration of DELs with ML presents transformative opportunities for early drug discovery, but realizing this potential requires overcoming the critical generalizability gap. The primary challenge is not data volume but data nature: intrinsic structural biases and systematic false negatives (often linker-induced) cause models to memorize library-specific artifacts rather than learn transferable pharmacophore principles. High internal validation metrics frequently mask failures to extrapolate to novel, pharmacologically relevant scaffolds.
The authors advocate for a paradigm shift in DEL-ML development emphasizing:
Rigorous validation standards: Moving beyond random splits to scaffold-based and out-of-distribution evaluation
Domain alignment strategies: Explicit handling of distribution shifts through domain adaptation and transfer learning
Data diversity expansion: Open-source DEL datasets spanning broader drug-like chemical space to reduce single-library bias
Integration of physics-based priors: Incorporating docking constraints to reduce overfitting to synthetic artifacts
Uncertainty quantification: Systematic use of conformal prediction and applicability domain assessment
By pivoting from simple accuracy maximization to robust domain alignment, DEL-ML can evolve from a retrospective analysis tool into a reliable engine for identifying novel chemical starting points. The establishment of standardized benchmarks and community resources will be essential to accelerate the development of generalizable predictive models capable of exploring the vast chemical space beyond individual DEL compositions.