Toward generalizable predictive models for DNA-encoded libraries

19 February 2026

Vasanthanathan Poongavanam , S. Pauliina Turunen , Kristian Sandberg , Ulrika Yngve , Johan Wannberg

Drug Discovery Today

DOI: 10.1016/j.drudis.2026.104629

Abstract

DNA-encoded libraries (DELs) combined with machine learning (ML) offer a powerful paradigm for hit identification. However, sequencing-derived enrichment data are inherently noisy and biased, often resulting in models that overfit to specific chemical libraries. In this review, we critically evaluate the capabilities and limitations of DEL-ML, illustrating key challenges using Aurora Kinase A (AURKA) DEL affinity selection data. We demonstrate that standard ML models often struggle to generalize to unseen chemical space because of the specific structural constraints of combinatorial libraries. Furthermore, we discuss the necessity of rigorous denoising strategies and evaluate approaches, such as domain adaptation, to mitigate these limitations, offering a roadmap for building robust models capable of exploring diverse chemical space.

Summary

This review critically examines the integration of machine learning (ML) with DNA-encoded library (DEL) technology for drug discovery. While DEL-ML offers a powerful paradigm for hit identification by generating massive binding datasets (10⁶–10¹² data points), the authors identify a critical "generalizability gap" that limits the practical utility of current models. Using Aurora Kinase A (AURKA) as a case study with OpenDEL 4.0 screening data (~1.5 million data points), the authors demonstrate that standard ML models achieve high accuracy on internal validation but frequently fail to generalize to structurally novel scaffolds due to domain shift—the substantial difference between DEL chemical space and known pharmacological compounds. The review provides methodological best practices for data preprocessing, denoising, and validation, while evaluating advanced strategies such as domain adaptation to improve model robustness. The authors argue that future DEL-ML development must move beyond simple accuracy maximization toward explicit handling of distribution shifts to transform DEL-ML from a retrospective analysis tool into a reliable engine for novel chemical discovery.

Highlights

1. The Generalizability Challenge in DEL-ML

Models trained on DEL data often memorize library-specific building blocks rather than learning transferable structure-activity relationships
The BELKA competition revealed that models perform well on test sets within the same chemical space but fail on structurally novel scaffolds
Domain shift between DEL training data and external compound collections represents a fundamental barrier to practical application

2. Data Quality and Preprocessing Considerations

DEL sequencing data contains unique noise profiles including matrix binding, DNA-tag interference, unequal synthesis yields, and "jackpot" effects
Multiple denoising strategies are evaluated: fold-enrichment, Z-scores for ultra-large libraries, disynthon aggregation, and uncertainty-aware probabilistic loss functions
Critical importance of subtracting background noise from control experiments (matrix/bead-only) to prevent false positives

3. Class Imbalance and Data Splitting Strategies

DEL selections produce highly imbalanced datasets (10¹–10⁴ binders vs. up to ~10¹² nonbinders)
Random splitting leads to overoptimistic metrics due to high structural similarity within DEL congeneric series
Scaffold-based or library-based splitting provides more rigorous assessment of generalizability to novel chemotypes
Undersampling nonbinders (e.g., 1:1 ratio) can boost external sensitivity from ~1% to 20–30%, though this may reflect bias exploitation rather than true generalization

4. Molecular Representation and Model Architectures

Traditional fingerprints and physicochemical descriptors often fail to capture subtle variations in DEL compounds
Graph neural networks (GNNs) and variational autoencoders (VAEs) show promise but require careful handling of linker/DNA-tag artifacts
Compositional (disynthon) approaches reduce sparsity but risk losing "whole-molecule" structural fidelity
Conformal prediction frameworks provide calibrated confidence intervals essential for prioritizing predictions in noisy DEL environments

5. Domain Adaptation as a Solution Strategy

Covariate shift correction reduces divergence between source (DEL) and target (known binder) domains
Using high-confidence predictions from diverse compound collections (e.g., Enamine REAL Diversity Set) as an intermediate domain improves generalization
Domain adaptation reduced PCA centroid distance from 0.77 to 0.32 between DEL training data and known AURKA space
Retraining with both predicted binders and nonbinders improved Matthews Correlation Coefficient (MCC) from 0.2 to 0.4 on external datasets while maintaining 20–39% sensitivity

6. AURKA Case Study Findings

OpenDEL 4.0-derived binders tended to be larger, more lipophilic, and less polar compared to known AURKA inhibitors
Despite overall domain shift, highly enriched DEL hits from sublibrary 27 shared conserved hinge-binding motifs with established inhibitors (e.g., VX-680)
Mechanistic alignment between DEL hits and known binders confirms that domain shift, rather than fundamental binding mode differences, drives prediction failures

Conclusion

The integration of DELs with ML presents transformative opportunities for early drug discovery, but realizing this potential requires overcoming the critical generalizability gap. The primary challenge is not data volume but data nature: intrinsic structural biases and systematic false negatives (often linker-induced) cause models to memorize library-specific artifacts rather than learn transferable pharmacophore principles. High internal validation metrics frequently mask failures to extrapolate to novel, pharmacologically relevant scaffolds.

The authors advocate for a paradigm shift in DEL-ML development emphasizing:

Rigorous validation standards:
Moving beyond random splits to scaffold-based and out-of-distribution evaluation

Domain alignment strategies:
Explicit handling of distribution shifts through domain adaptation and transfer learning

Data diversity expansion:
Open-source DEL datasets spanning broader drug-like chemical space to reduce single-library bias

Integration of physics-based priors:
Incorporating docking constraints to reduce overfitting to synthetic artifacts

Uncertainty quantification:
Systematic use of conformal prediction and applicability domain assessment

By pivoting from simple accuracy maximization to robust domain alignment, DEL-ML can evolve from a retrospective analysis tool into a reliable engine for identifying novel chemical starting points. The establishment of standardized benchmarks and community resources will be essential to accelerate the development of generalizable predictive models capable of exploring the vast chemical space beyond individual DEL compositions.

Back to DELHunter

Already have an account?