DEL Simulator: A Digital Twin for Understanding Machine Learning on DNA-Encoded Libraries

Artur Menzeleev, Sathya Chitturi, Geraint Davies ,Tony Schroeder ,Alpha Lee

ChemRxiv

 D O I: 10.26434/chemrxiv-2025-8rw8j

Abstract

DNA-encoded libraries (DELs) are a powerful way to find chemical starting points against challenging biological targets, by rapidly generating billion-scale structure-activity datasets. However, DEL experiment design and interpretation, especially the optimal use of machine learning (ML) to analyse the vast amount of generated data or to screen large external purchasable datasets, remain poorly understood. To address these challenges, we report the development of a digital twin – an in-silico DEL simulator – that models the underlying chemistry and selection processes of typical experiments as a function of key design parameters, including read count, cycles of selection, one-step reaction yield, and library size. We systematically investigate how these design parameters influence downstream ML virtual screening and identify specific regimes where the choice to apply preprocessing steps such as disynthon aggregation can significantly enhance screening performance. In addition, we show that increasing library size can degrade ML-based screening performance. Our simulator provides a statistically principled way to understand and analyse DEL experiments via an interpretable model for DEL data generation.

logo
logo