Kejia Yan , Guilherme M. Lima , Tara Bahadur , Vincent Albert , Zoe O’Gara , Gary Bao , Christin Kossmann , William Kirby , Fernando B. Mejia , Matthew L. Michnik , Kristen Maiorana , Ratmir Derda
bioRxiv - Biochemistry
DOI: 10.64898/2026.02.14.705946
Abstract
Genetically encoded (GE) libraries enable identification of high-affinity ligands for diverse molecular targets through iterative in vitro selection and DNA sequencing or next-generation sequencing (NGS). Despite their impact in therapeutic development, a systematic framework for evaluating reproducibility in GE-molecular discoveries remains limited. To aid such analysis, we introduce the concept of baseline response, which reproducibly partitions active and inactive members of in vitro selection. The baseline response is provided by spiking a random DNA-barcoded population. We calibrated the baseline concept using Bioconductor EdgeR differential enrichment (DE) analysis of NGS of phage-displayed selection on oligosaccharide chitin and hepatitis virus NS3a* protease as model targets. We further show that mixing discovery campaigns also offers an effective baseline: chitin-enriched peptides serve as a baseline for DE-analysis of NS3a* selection and NS3a*-enriched peptides serve as a baseline for chitin binders. We applied baseline-stratified DE-analysis to 66 parallel selections performed in 3–5 replicates across 22 extracellular targets, including HER1-3, EpCAM, CAIX, PD-L1, and eight integrin receptors. Automated DE-analysis across hundreds of NGS files produced hits validated in a secondary screen and yielded synthetic macrocyclic ligands with mid-nanomolar affinity confirmed in 2–3 biophysical assays. For PD-L1, we further demonstrated how baseline-calibrated NGS data provide decision-enabling information for optimization of peptide macrocycles to yield potent single-digit nanomolar ligands for the cell-surface receptor. We anticipate that baseline-based analyses of NGS data from in vitro selection procedures will offer a scalable framework for reproducible hit discovery and standardized analysis across diverse in vitro selection campaigns.
Summary
This work introduces a universal baseline framework for in vitro selection of genetically encoded (GE) libraries—e.g., phage-displayed peptide libraries—to improve reproducibility, statistical rigor, and cross-target comparability. The core innovation is spiking a DNA-barcoded random peptide library (serving as an in situ or “cross-target” empirical baseline) into every selection round. This baseline mimics naïve library binding behavior and enables robust normalization and differential enrichment (DE) analysis using Bioconductor EdgeR on NGS data. Validation spanned 22–24 extracellular protein targets (including HER1–3, PD-L1, integrins, NS3a*, chitin) across 66 parallel selections. Baseline-stratified DE identified high-confidence hits, including synthetic macrocyclic ligands with mid- to single-digit nM affinity confirmed by biophysical assays. The method also supports functional benchmarking—e.g., revealing reduced infectivity in MBX-modified phage libraries—and replaces synthetic or computational baselines with empirically derived, target-agnostic mixtures.
Highlights
Conclusion
The universal baseline standardizes hit discovery, improves enrichment fidelity assessment, and enables ML-ready, statistically benchmarked data generation without structural priors.