Toward generalizable predictive models for DNA-encoded libraries

Vasanthanathan Poongavanam ,  S. Pauliina Turunen ,  Kristian Sandberg ,  Ulrika Yngve ,  Johan Wannberg

Drug Discovery Today 

DOI: 10.1016/j.drudis.2026.104629

Abstract

DNA-encoded libraries (DELs) combined with machine learning (ML) offer a powerful paradigm for hit identification. However, sequencing-derived enrichment data are inherently noisy and biased, often resulting in models that overfit to specific chemical libraries. In this review, we critically evaluate the capabilities and limitations of DEL-ML, illustrating key challenges using Aurora Kinase A (AURKA) DEL affinity selection data. We demonstrate that standard ML models often struggle to generalize to unseen chemical space because of the specific structural constraints of combinatorial libraries. Furthermore, we discuss the necessity of rigorous denoising strategies and evaluate approaches, such as domain adaptation, to mitigate these limitations, offering a roadmap for building robust models capable of exploring diverse chemical space.

Summary

This review critically examines the integration of machine learning (ML) with DNA-encoded library (DEL) technology for drug discovery. While DEL-ML offers a powerful paradigm for hit identification by generating massive binding datasets (10⁶–10¹² data points), the authors identify a critical "generalizability gap" that limits the practical utility of current models. Using Aurora Kinase A (AURKA) as a case study with OpenDEL 4.0 screening data (~1.5 million data points), the authors demonstrate that standard ML models achieve high accuracy on internal validation but frequently fail to generalize to structurally novel scaffolds due to domain shift—the substantial difference between DEL chemical space and known pharmacological compounds. The review provides methodological best practices for data preprocessing, denoising, and validation, while evaluating advanced strategies such as domain adaptation to improve model robustness. The authors argue that future DEL-ML development must move beyond simple accuracy maximization toward explicit handling of distribution shifts to transform DEL-ML from a retrospective analysis tool into a reliable engine for novel chemical discovery.

Highlights

1. The Generalizability Challenge in DEL-ML

  • Models trained on DEL data often memorize library-specific building blocks rather than learning transferable structure-activity relationships
  • The BELKA competition revealed that models perform well on test sets within the same chemical space but fail on structurally novel scaffolds
  • Domain shift between DEL training data and external compound collections represents a fundamental barrier to practical application

2. Data Quality and Preprocessing Considerations

  • DEL sequencing data contains unique noise profiles including matrix binding, DNA-tag interference, unequal synthesis yields, and "jackpot" effects
  • Multiple denoising strategies are evaluated: fold-enrichment, Z-scores for ultra-large libraries, disynthon aggregation, and uncertainty-aware probabilistic loss functions
  • Critical importance of subtracting background noise from control experiments (matrix/bead-only) to prevent false positives

3. Class Imbalance and Data Splitting Strategies

  • DEL selections produce highly imbalanced datasets (10¹–10⁴ binders vs. up to ~10¹² nonbinders)
  • Random splitting leads to overoptimistic metrics due to high structural similarity within DEL congeneric series
  • Scaffold-based or library-based splitting provides more rigorous assessment of generalizability to novel chemotypes
  • Undersampling nonbinders (e.g., 1:1 ratio) can boost external sensitivity from ~1% to 20–30%, though this may reflect bias exploitation rather than true generalization

4. Molecular Representation and Model Architectures

  • Traditional fingerprints and physicochemical descriptors often fail to capture subtle variations in DEL compounds
  • Graph neural networks (GNNs) and variational autoencoders (VAEs) show promise but require careful handling of linker/DNA-tag artifacts
  • Compositional (disynthon) approaches reduce sparsity but risk losing "whole-molecule" structural fidelity
  • Conformal prediction frameworks provide calibrated confidence intervals essential for prioritizing predictions in noisy DEL environments

5. Domain Adaptation as a Solution Strategy

  • Covariate shift correction reduces divergence between source (DEL) and target (known binder) domains
  • Using high-confidence predictions from diverse compound collections (e.g., Enamine REAL Diversity Set) as an intermediate domain improves generalization
  • Domain adaptation reduced PCA centroid distance from 0.77 to 0.32 between DEL training data and known AURKA space
  • Retraining with both predicted binders and nonbinders improved Matthews Correlation Coefficient (MCC) from 0.2 to 0.4 on external datasets while maintaining 20–39% sensitivity

6. AURKA Case Study Findings

  • OpenDEL 4.0-derived binders tended to be larger, more lipophilic, and less polar compared to known AURKA inhibitors
  • Despite overall domain shift, highly enriched DEL hits from sublibrary 27 shared conserved hinge-binding motifs with established inhibitors (e.g., VX-680)
  • Mechanistic alignment between DEL hits and known binders confirms that domain shift, rather than fundamental binding mode differences, drives prediction failures

Conclusion

The integration of DELs with ML presents transformative opportunities for early drug discovery, but realizing this potential requires overcoming the critical generalizability gap. The primary challenge is not data volume but data nature: intrinsic structural biases and systematic false negatives (often linker-induced) cause models to memorize library-specific artifacts rather than learn transferable pharmacophore principles. High internal validation metrics frequently mask failures to extrapolate to novel, pharmacologically relevant scaffolds.

The authors advocate for a paradigm shift in DEL-ML development emphasizing:

Rigorous validation standards:
Moving beyond random splits to scaffold-based and out-of-distribution evaluation

Domain alignment strategies:
Explicit handling of distribution shifts through domain adaptation and transfer learning

Data diversity expansion:
Open-source DEL datasets spanning broader drug-like chemical space to reduce single-library bias

Integration of physics-based priors:
Incorporating docking constraints to reduce overfitting to synthetic artifacts

Uncertainty quantification:
Systematic use of conformal prediction and applicability domain assessment

By pivoting from simple accuracy maximization to robust domain alignment, DEL-ML can evolve from a retrospective analysis tool into a reliable engine for identifying novel chemical starting points. The establishment of standardized benchmarks and community resources will be essential to accelerate the development of generalizable predictive models capable of exploring the vast chemical space beyond individual DEL compositions.

logo
logo