Artur Menzeleev, Sathya Chitturi, Geraint Davies ,Tony Schroeder ,Alpha Lee
ChemRxiv
D O I: 10.26434/chemrxiv-2025-8rw8j
Abstract
DNA-encoded libraries (DELs) are a powerful way to find chemical starting points against challenging biological targets, by rapidly generating billion-scale structure-activity datasets. However, DEL experiment design and interpretation, especially the optimal use of machine learning (ML) to analyse the vast amount of generated data or to screen large external purchasable datasets, remain poorly understood. To address these challenges, we report the development of a digital twin – an in-silico DEL simulator – that models the underlying chemistry and selection processes of typical experiments as a function of key design parameters, including read count, cycles of selection, one-step reaction yield, and library size. We systematically investigate how these design parameters influence downstream ML virtual screening and identify specific regimes where the choice to apply preprocessing steps such as disynthon aggregation can significantly enhance screening performance. In addition, we show that increasing library size can degrade ML-based screening performance. Our simulator provides a statistically principled way to understand and analyse DEL experiments via an interpretable model for DEL data generation.