Protein–ligand interaction (PLI) prediction is a central task in computational drug discovery. Existing public affinity datasets such as BindingDB and ChEMBL are highly heterogeneous in origin, having been aggregated from thousands of laboratories using many different experimental protocols. As a result, they suffer from systematic bias and substantial standardization challenges, which in turn limit model generalization. In contrast, DNA-encoded library (DEL) technology enables ultrahigh-throughput screening of billions of compounds under unified experimental protocols, providing a new source of large-scale, high-quality training data for PLI modeling.
Hermes adopts a lightweight Transformer-based architecture. Its main components are:
For final inference, the authors used an ensemble of nine checkpoints trained with different hyperparameter settings and training-sampling strategies, and averaged their predictions.
Figure 1. Hermes architecture diagram.
Hermes was trained on DEL screening data generated from the Kin0 chemical library, which contains 6.5 million members and was constructed as a three-cycle library using 38 cores connected to 384 and 446 building blocks. The dataset covers 239 unique protein targets, approximately two-thirds of which are kinases.
Labels were generated through a binarized hit-calling procedure based on enrichment relative to control screens, including DEL-only, bead-only/no-target, and proprietary controls. To manage class imbalance, the training procedure capped the number of positive samples per protein target, retained the highest-enrichment hits, and paired each positive example with a fixed number of negatives, drawn from both random negatives and hard negatives.
Table 1. Training and evaluation dataset statistics.
Hermes was evaluated on four benchmark datasets designed to test different forms of generalization:
1. DEL Protein Split: 164 protein targets not seen during training, screened against the same Kin0 library. This benchmark tests cold-target generalization.
2. DEL Chemical Library Split (STRELKA): 59 protein targets seen in training, but screened against a different 1-million-member benzimidazole library (AMA020). This benchmark tests cold-ligand generalization.
3. Public Binders/Decoys: 403 protein targets, with positives from Papyrus++ and negatives from GuacaMol property-matched synthetic decoys. This benchmark evaluates generalization to external public binding data.
4. MF-PCBA: 26 protein targets from PubChem BioAssay high-throughput screening data, where confirmed dose-response actives are positives and primary-screen inactives are negatives. This benchmark tests performance on heterogeneous public screening data.
The authors compared Hermes against two baselines:
Table 2. Hermes vs benchmarks per-protein AUROC comparison.
Key Findings
The results show clear variation across benchmarks, but several important patterns emerge:
The paper also reports that Hermes performs better on kinase targets than on non-kinases in most benchmarks, which is consistent with the kinase-enriched composition of the training set. This suggests that targeted DEL data generation can improve generalization within a protein family.
Table 3. Inference speed comparison.
Hermes is designed for efficient inference. After correcting for hardware differences, the authors estimate that Hermes is approximately 500–700 times faster than Boltz-2. Its sequence-only design allows protein embeddings to be cached, making it well suited for billion-scale virtual screening campaigns.
This study shows that a model trained exclusively on DEL screening data can learn transferable representations of protein–ligand interactions. Without ever being trained on traditional affinity measurements, Hermes generalizes to unseen protein targets, unseen chemical scaffolds, and external datasets derived from different experimental systems. This provides a strong proof of concept for the use of DEL data in PLI modeling.
At the same time, the study identifies several limitations. First, the kinase-heavy training distribution leads to weaker performance on non-kinase proteins. Second, binary label binarization likely restricts model expressiveness. Third, performance on data similar to the training set still appears to be influenced by memorization effects. Future work may benefit from incorporating structure-prediction outputs, modeling continuous enrichment scores instead of binary labels, and improving sampling strategies to better support generalization to unseen protein families.
Overall, as DEL technology continues to mature and generate data at a faster pace than public affinity databases, DEL-trained models such as Hermes are likely to become an important methodology for the next generation of PLI prediction
Kleinsasser M, Halverson B J , Kraft E ,et al.Hermes: Large DEL Datasets Train Generalizable Protein-Ligand Binding Prediction Models[J]. 2026.