Wenyi Zhang, Yuxing Wang, Rui Zhan, Runtong Qian, Qi Hu, Jing Huang
bioRxiv - Biophysics
DOI: 10.1101/2025.06.12.659183
Abstract
DNA-encoded libraries (DELs) facilitate high-throughput screening of trillions of molecules against protein targets through split-pool synthesis and DNA tagging. Despite their potential, only a few DEL-derived compounds have advanced to clinical trials or reached the market. A better understanding of the defining characteristics of target proteins, particularly those with binding pockets suitable for DEL screening, is critical to improving success rates. However, existing approaches remain limited in assessing pocket flexibility and functional similarity. Here, we present ErePOC, a pocket representation model based on contrastive learning with ESM-2 embeddings to address these challenges. ErePOC captures both structural and functional features of binding pockets, enabling identification of shared characteristics among DEL targets. By integrating analyses of low-dimensional physicochemical properties and high-dimensional ErePOC embeddings, we provide a comprehensive view of DEL target space. With 98% precision in downstream classification tasks, ErePOC demonstrates high performance in pocket representation, which is then applied to predict human proteins suitable for DEL screening, with enrichment uncovered across 18 functional categories. This work establishes a new framework for enhancing DEL-based drug discovery through more effective target selection and pocket similarity analysis.
Summary
This study introduces ErePOC, a novel pocket representation model that employs contrastive learning with ESM-2 embeddings to decode the defining characteristics of protein binding pockets amenable to DNA-encoded library (DEL) screening. Despite DEL technology's capacity to screen trillions of compounds, clinical translation remains limited due to poor understanding of target druggability. The researchers analyzed 128 successful DEL targets and compared them to 326,416 general ligand pockets (BioLiP2) and 340 FDA-approved drug pockets, revealing that DEL pockets are uniquely larger (28.1 vs 16.1 residues), more hydrophobic, and enriched in specific amino acids (Met, Tyr, Trp, Phe, Leu). ErePOC was trained to map pockets to a 256-dimensional latent space aligned with ligand chemical similarity, achieving 98% precision in functional classification. Applied to 23,391 AlphaFold2-predicted human proteins, the model identified 2,739 DEL-compatible targets with pockets showing >0.8 cosine similarity to known DEL pockets. Enrichment analysis revealed 18 functional categories, particularly oxidoreductases, transferases, and multifunctional enzymes. In silico docking of 2.8 million virtual DEL compounds against 14 selected targets confirmed that ErePOC-enriched proteins exhibit significantly better predicted binding affinities than neutral controls. This work establishes a computational framework for rational DEL target selection beyond traditional structural similarity metrics.
Highlights
Conclusion
ErePOC provides a transformative approach to DEL target selection by learning high-dimensional, function-aware representations of binding pockets that transcend traditional structural alignment limitations. The model successfully deciphers a unique DEL pocket pattern—characterized by larger size, enhanced hydrophobicity, and specific amino acid biases—and leverages this to predict over 2,700 human proteins likely amenable to DEL screening across 18 enriched functional categories. By capturing physicochemical relationships rather than relying solely on geometric similarity, ErePOC addresses the critical challenge of pocket flexibility and low structural overlap among functionally related sites. The significant enrichment of oxidoreductases, transferases, and multifunctional enzymes validates known DEL success stories while expanding the targetable space to include chromatin regulators and RNA-binding proteins. In silico validation confirms that ErePOC-selected targets bind DEL-like molecules more favorably, supporting its practical utility. This framework not only enhances DEL efficiency but also offers broad applicability for virtual screening, molecule generation, and protein design, particularly when integrated with advanced structure prediction tools like AlphaFold3.