ProteinMPNN for De Novo Enzyme Design: A Complete Guide for Researchers and Drug Developers

Jaxon Cox Jan 12, 2026 431

This comprehensive article explores ProteinMPNN, a revolutionary protein sequence design tool based on deep neural networks.

ProteinMPNN for De Novo Enzyme Design: A Complete Guide for Researchers and Drug Developers

Abstract

This comprehensive article explores ProteinMPNN, a revolutionary protein sequence design tool based on deep neural networks. We begin by establishing the foundational principles of de novo enzyme design and the limitations of prior computational methods. The guide then details the practical methodology for implementing ProteinMPNN, including input preparation, sequence generation, and application-specific workflows for designing enzymes with novel functions. We address common challenges and optimization strategies for improving success rates and computational efficiency. Finally, we compare ProteinMPNN's performance against other leading models (RFdiffusion, AlphaFold, Rosetta) and discuss rigorous experimental validation frameworks. This resource is tailored for researchers, scientists, and drug development professionals seeking to harness AI for creating functional enzymes.

What is ProteinMPNN? Unpacking the AI Revolution in De Novo Enzyme Design

The design of functional enzymes de novo, without reliance on natural evolutionary templates, represents a grand challenge in biochemistry and synthetic biology. The core difficulty lies in navigating an astronomically large sequence space to identify sequences that will fold into stable structures and catalyze specific reactions with high efficiency and selectivity. Computational methods have become indispensable for this task, transforming it from blind screening to a principled engineering discipline.

This Application Note frames the challenge within the context of a broader thesis on ProteinMPNN, a state-of-the-art protein sequence design neural network. While traditional structure-based design (e.g., using Rosetta) is powerful, it can be computationally expensive for de novo backbone scaffolding. ProteinMPNN offers a fast, robust, and high-performing solution for generating sequences compatible with a given protein backbone, making it a critical tool for the iterative design-test-learn cycles required for successful de novo enzyme creation. The integration of ProteinMPNN with reaction coordinate placement (e.g., using Rosetta or molecular dynamics) and functional site design tools forms the modern computational pipeline for enzyme design.

Quantitative Landscape: Success Rates and Key Metrics

The table below summarizes key quantitative data from recent literature on de novo enzyme design projects, highlighting the scale of the challenge and the role of computational filtering.

Table 1: Performance Metrics in Recent De Novo Enzyme Design Studies

Design Target / Study Initial Sequence Pool (Computational) Experimentally Tested Active Variants Found Success Rate Catalytic Efficiency (kcat/KM) Key Computational Tool
Kemp Eliminase (Huang et al., Nature, 2023 - follow-up) ~100,000 designs 128 19 ~14.8% Up to 1.7 × 10⁵ M⁻¹s⁻¹ Rosetta, ProteinMPNN, MD
De Novo TIM Barrel for Retro-Aldolase (Polizzi & DeGrado, Science, 2022) 2,500 backbone architectures 12 scaffolds 4 ~33% (scaffolds) ~10² M⁻¹s⁻¹ (above background) RFdiffusion, ProteinMPNN
De Novo Phosphotriesterase-like Lactonase (Rocklin et al., Science, 2017) 2,903 designs 44 3 ~6.8% 1.5 × 10⁴ M⁻¹s⁻¹ Rosetta
Generalist De Novo Enzyme for Morita-Baylis-Hillman Reaction (Wu et al., Nature, 2024) >500,000 designs 279 12 ~4.3% kcat up to 370 h⁻¹ Family-wide ProteinMPNN, MD
Average/Representative for earlier (pre-2020) designs (Multiple Sources) 10⁴ - 10⁶ 10¹ - 10² 1-10 0.1% - 5% Often 10² - 10⁴ M⁻¹s⁻¹ Rosetta (pre-ProteinMPNN)

Key Insight: The data shows that while computational pre-screening improves odds from astronomically low to tractable (~0.1-30% success), experimental validation of dozens to hundreds of designs is still necessary. Success rates are improving with tools like ProteinMPNN, which generate more stable, foldable sequences, thereby increasing the likelihood of functional active site formation.

Core Experimental Protocols

Protocol 1: Integrated Computational Pipeline forDe NovoEnzyme Design Using ProteinMPNN

Objective: To generate, rank, and select de novo enzyme sequences for a target reaction.

Materials:

  • Hardware: High-performance computing cluster (CPU/GPU).
  • Software: PyRosetta or Rosetta3, ProteinMPNN (local or API), molecular dynamics suite (e.g., GROMACS, OpenMM), Python/R for analysis.
  • Input: Target reaction mechanism, transition state model (or set of key catalytic residues/orientations - "theozyme").

Procedure:

Step 1: Active Site & Theozyme Definition.

  • Define the reaction's mechanistic steps using quantum mechanics (QM) software (e.g., Gaussian, ORCA).
  • Extract the ideal geometries (bond lengths, angles) of the transition state and key catalytic residues (e.g., a triad, metal coordination sphere). This set of constraints is the "theozyme."

Step 2: De Novo Backbone Scaffold Generation.

  • Use a de novo backbone generator like RFdiffusion or RosettaRemodel to create protein backbones that can spatially accommodate the theozyme geometry.
  • Input: Theozyme residue coordinates as constraints.
  • Output: A library of 1,000-10,000 unique backbone structures (PDB format).

Step 3: Sequence Design with ProteinMPNN.

  • Prepare each generated backbone (scaffold) PDB file. Ensure correct chain IDs and remove any non-scaffold residues.
  • Run ProteinMPNN in "fixed backbone" mode.
    • Specify positions to be designed (typically all except catalytic theozyme residues, which are fixed).
    • Use the --conditional_probs_only flag to bias designs toward specific amino acids at non-catalytic but structurally important positions if known.
    • Generate 8-64 sequences per backbone.
  • Output: A fasta file containing thousands of designed protein sequences.

Step 4: Energetic & Functional Filtering with Rosetta.

  • For each designed sequence, perform Rosetta Relax and Rosetta ddG (∆∆G) calculations to assess folding energy and stability.
  • Use Rosetta Enzyme Design (RosettaED) protocols to introduce and minimize the substrate in the designed active site. Calculate binding energy and theozyme constraint satisfaction metrics.
  • Filter designs based on: ∆∆G < 0 (stable), favorable binding energy, and high constraint satisfaction score.
  • Rank the top 100-500 designs.

Step 5: Molecular Dynamics (MD) Validation.

  • Solvate and equilibrate the top 20-50 ranked designs using a molecular dynamics package.
  • Run 50-100 ns simulations to assess:
    • Structural stability (backbone RMSD).
    • Integrity of the active site geometry (distance/angle constraints of theozyme).
    • Dynamics of substrate access tunnels.
  • Select final candidates (10-100) that remain stable and maintain catalytic geometry.

Step 6: Experimental Expression & Testing. (See Protocol 2)

Protocol 2: High-Throughput Experimental Validation of Designed Enzymes

Objective: To express, purify, and assay computationally designed enzyme variants.

Materials:

  • Cloning: Synthetic genes (codon-optimized), expression vector (e.g., pET series), Gibson Assembly or Golden Gate cloning reagents.
  • Expression: E. coli BL21(DE3) or similar competent cells, LB broth, antibiotics, IPTG.
  • Lysis: BugBuster or sonication, lysozyme, benzonase, protease inhibitor cocktail.
  • Purification: HisTrap FF crude or Ni-NTA agarose, ÄKTA pure or FPLC system, size-exclusion chromatography (SEC) column (e.g., Superdex 75 Increase).
  • Assay: Microplate reader (UV-Vis, fluorescence), substrate, reaction buffer.

Procedure:

Step 1: Gene Synthesis & Cloning.

  • Order designed sequences as synthetic gene fragments in a cloning-compatible vector.
  • Subclone into an expression vector (e.g., pET-28a(+) for N- or C-terminal His-tag) using restriction-free or Golden Gate methods.
  • Transform into cloning strain (e.g., DH5α), sequence-verify plasmids.

Step 2: Small-Scale Expression Screening.

  • Transform verified plasmids into expression host (BL21(DE3)).
  • Inoculate 2 mL deep-well plates with cultures. Grow at 37°C to OD600 ~0.6-0.8.
  • Induce with 0.1-1.0 mM IPTG. Express at 16-20°C for 16-20 hours.
  • Pellet cells. Lyse via chemical (BugBuster) or freeze-thaw. Clarify lysates by centrifugation.
  • Perform SDS-PAGE on lysates to identify constructs expressing soluble protein.

Step 3: Purification (96-well plate or medium-scale).

  • For soluble constructs, perform immobilized metal affinity chromatography (IMAC) in a 96-well filter plate format or using 5 mL culture mini-preps.
  • Bind His-tagged protein to Ni-NTA resin in batch. Wash with 20 mM imidazole. Elute with 250 mM imidazole.
  • Desalt into assay buffer using Zeba spin plates or dialysis.

Step 4: High-Throughput Activity Assay.

  • In a 96- or 384-well plate, mix purified enzyme (10-100 µL, ~1-10 µM final) with substrate in reaction buffer.
  • Monitor reaction progress in real-time using plate reader (e.g., absorbance change, fluorescence increase).
  • Include positive (known enzyme) and negative (no enzyme, scrambled design) controls.
  • Calculate initial velocities. Identify "hits" with activity significantly above background.

Step 5: Hit Characterization.

  • Scale up expression and purification of hits (1L culture) using FPLC (IMAC followed by SEC).
  • Determine exact protein concentration (A280).
  • Perform Michaelis-Menten kinetics: vary substrate concentration, measure initial velocity. Fit data to obtain kcat and KM.
  • Validate folds using circular dichroism (CD) spectroscopy.

Visualizations

G Theozyme Theozyme Definition (QM of Reaction) ScaffoldGen Backbone Scaffold Generation (RFdiffusion/Rosetta) Theozyme->ScaffoldGen Geometric Constraints SeqDesign Sequence Design (ProteinMPNN) ScaffoldGen->SeqDesign Backbone PDBs FilterRank Energetic Filtering & Ranking (Rosetta ddG/Relax/ED) SeqDesign->FilterRank Sequence Library (FASTA) MDValidate MD Validation (Stability & Dynamics) FilterRank->MDValidate Top 100 Designs Experiment Experimental Expression & Assay MDValidate->Experiment Final 10-100 Candidates Experiment->Theozyme Learning Loop: Data for Model Refinement

De Novo Enzyme Design Computational Pipeline

G cluster_exp Experimental Validation Workflow Gene Gene Synthesis & Cloning ExprScreen Small-Scale Expression Screening Gene->ExprScreen Purif High-Throughput Purification (IMAC) ExprScreen->Purif HTAssay HTS Activity Assay (96/384-well) Purif->HTAssay HitChar Hit Characterization (Kinetics, CD, SEC) HTAssay->HitChar

HTS Workflow for Designed Enzyme Validation

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials for De Novo Enzyme Design & Validation

Item Supplier Examples Function in Protocol
Rosetta Software Suite University of Washington (academic license) Core software for protein energy calculations, backbone generation (RosettaRemodel), and enzyme design (RosettaED). Used for filtering and ranking designs.
ProteinMPNN GitHub Repository (Baker Lab) Fast, robust neural network for protein sequence design. Generates foldable sequences for given backbones. Integrated into the design pipeline after scaffolding.
RFdiffusion GitHub Repository (Baker Lab) Diffusion model for generating de novo protein backbones conditioned on functional site (theozyme) placement. Creates the initial scaffolds.
pET Expression Vectors Novagen (MilliporeSigma), Addgene Standard plasmids for high-level, inducible protein expression in E. coli. Often used with His-tag for purification.
BL21(DE3) Competent Cells New England Biolabs (NEB), Thermo Fisher Standard E. coli strain for T7 promoter-driven protein expression. Optimized for low protease activity.
HisTrap FF crude Cytiva Pre-packed nickel affinity chromatography columns for fast purification of His-tagged proteins using FPLC systems (e.g., ÄKTA pure).
BugBuster Protein Extraction Reagent MilliporeSigma Gentle, ready-to-use detergent for lysing E. coli cells without sonication. Ideal for high-throughput, small-scale expression screening.
Zeba Spin Desalting Plates Thermo Fisher 96-well plates packed with size-exclusion resin for rapid buffer exchange and desalting of purified proteins prior to assay.
SpectraMax Microplate Reader Molecular Devices Versatile plate reader capable of absorbance, fluorescence, and luminescence detection. Essential for high-throughput enzyme kinetic assays.

The field of protein sequence design has undergone a revolutionary transformation, moving from physics-based energy minimization to data-driven generative modeling. This evolution is central to current research on ProteinMPNN for de novo enzyme sequence design. While early tools like Rosetta provided a foundational understanding of sequence-structure relationships, the advent of deep learning has dramatically increased the speed, scale, and success rate of generating functional protein sequences. This Application Note contextualizes these tools within a practical research workflow aimed at designing novel enzymatic activities.

Tool Comparison: Quantitative Performance Metrics

The following table summarizes the key characteristics and performance metrics of major protein sequence design tools, illustrating the trajectory of the field.

Table 1: Comparative Analysis of Protein Sequence Design Tools

Tool (Release Year) Core Methodology Key Input(s) Key Output Typical Design Speed Success Rate (Native-like sequences) Key Limitation
Rosetta de novo design (2000s) Monte Carlo + Physics-based Force Field Backbone Scaffold, Target Fold Amino Acid Sequence Minutes to hours per design ~1-10% (highly dependent on fold complexity) Computationally expensive; sensitive to force field inaccuracies
RFdiffusion (2022) Diffusion Generative Model Partial Structure, Motif Constraints Protein Backbone Coordinates Seconds to minutes per design N/A (Structure generation tool) Requires subsequent sequence design step
ProteinMPNN (2022) Message Passing Neural Network Protein Backbone + Optional Constraints Amino Acid Sequence < 1 second per design ~50-70% (folds as designed) Trained on native structures; limited extrapolation far from natural space
AlphaFold2 (2020) Evoformer + Structure Module Amino Acid Sequence Predicted 3D Structure Minutes per structure High accuracy for natural sequences Not a design tool; used for in silico validation

Application Notes: Integrating ProteinMPNN into an Enzyme Design Pipeline

Conceptual Workflow forDe NovoEnzyme Design

The following diagram outlines a standard integrated pipeline for designing novel enzyme sequences, positioning ProteinMPNN as the core sequence design engine.

EnzymeDesignPipeline Start Define Catalytic Motif & Active Site Geometry RFdiffusion Backbone Generation (RFdiffusion) Start->RFdiffusion Structural Constraints ProteinMPNN Sequence Design (ProteinMPNN) RFdiffusion->ProteinMPNN Backbone PDB AF2 Folding Validation (AlphaFold2) ProteinMPNN->AF2 Designed Sequence MD Stability & Dynamics (Molecular Dynamics) AF2->MD Predicted Structure Filter Experimental Filter & Cloning MD->Filter pLDDT, RMSD, ΔG End Experimental Characterization Filter->End High-scoring Variants

Diagram Title: Integrated Computational Pipeline for De Novo Enzyme Design

ProteinMPNN-Specific Protocol: Designing Sequences for a Fixed Backbone

Protocol 1: Fixed-Backbone Sequence Design with Optional Symmetry and Residue Constraints

Objective: To generate diverse, low-energy amino acid sequences for a given protein backbone structure, incorporating research constraints such as fixed catalytic residues.

Research Reagent Solutions & Essential Materials:

Item Function in Protocol
Input Backbone PDB File The atomic coordinates of the target scaffold, lacking side chains beyond Cβ.
ProteinMPNN Software (v1.0) The neural network model for calculating sequence probabilities. Available via GitHub.
Python Environment (3.8+) with PyTorch Required runtime for executing ProteinMPNN.
Constraint Specification File (JSON/TXT) Defines fixed positions, residue identities, or biased amino acids for design.
High-Performance Computing (HPC) Cluster or GPU Accelerates sampling for large proteins or large numbers (e.g., 1000s) of designs.

Step-by-Step Methodology:

  • Prepare the Input Structure:

    • Obtain or generate a backbone structure (e.g., from RFdiffusion, a natural fold, or a idealized scaffold).
    • Use clean_pdb.py (provided in ProteinMPNN repository) to strip the structure to backbone atoms only (N, Cα, C, O) and Cβ, ensuring standard chain IDs and residue numbering.
  • Define Design Constraints (Optional but Critical for Enzymes):

    • Create a simple text file to specify which residues are fixed. For example, to fix positions A22 and A23 as Histidine and Aspartic acid (common catalytic residues):

    • For partial specification (e.g., bias towards hydrophobic residues at a core position), use the --bias_aa flag during execution.
  • Execute ProteinMPNN for Sequence Sampling:

    • Run the core design script from the command line. A typical command for generating 100 sequences with fixed residues is:

    • Key Parameter Explanation:

      • sampling_temp: Controls diversity. Lower (0.01-0.1) for conservative, low-energy designs; higher (0.1-0.3) for more exploration.
      • batch_size: Tunes for GPU memory.
  • Output Analysis:

    • The main output is a seqs directory containing FASTA files (my_scaffold.fasta) with the designed sequences.
    • Each sequence is accompanied by a per-residue log probability and a total score (negative sum of log probabilities). Lower total scores correspond to higher model confidence.

Advanced Protocol: Iterative Design-Validate-Refine Cycle

Protocol 2: In Silico Validation and Selection Pipeline

Objective: To filter thousands of ProteinMPNN-generated sequences via computational checks before experimental testing, maximizing the probability of functional enzymes.

Workflow Diagram:

ValidationPipeline MPNN_Seqs ProteinMPNN Sequence Pool (N=1000) AF2_Predict Structure Prediction (AlphaFold2/ColabFold) MPNN_Seqs->AF2_Predict Metrics Compute Metrics AF2_Predict->Metrics Filter1 Filter 1: pLDDT > 80 & RMSD < 2.0 Å Metrics->Filter1 MD_Sim Short MD Simulation (Stability Check) Filter1->MD_Sim Top 200 Sequences Filter2 Filter 2: ΔG Stability & Active Site Integrity MD_Sim->Filter2 Final_Set Final Candidate Set (N=20-50) Filter2->Final_Set

Diagram Title: Computational Filtration Workflow for Designed Sequences

Methodology Steps:

  • High-Throughput Folding with AlphaFold2/ColabFold:

    • Input the FASTA file from ProteinMPNN into ColabFold (local or cloud version) for batch processing. Use the --amber and --templates flags for higher quality.
    • Extract the predicted Local Distance Difference Test (pLDDT) score (per-residue and global average) and the predicted Aligned Error (PAE).
  • Primary Filtering Based on Folding Metrics:

    • Criterion 1: Global average pLDDT > 80. This indicates high per-residue confidence.
    • Criterion 2: Backbone Root-Mean-Square Deviation (RMSD) < 2.0 Å between the designed target backbone and the AF2-predicted structure. Ensures the design folds as intended.
    • Retain the top 10-20% of sequences passing these filters.
  • Secondary Filtering via Molecular Dynamics (MD):

    • Solvate and minimize the top-scoring predicted structures in explicit solvent (e.g., TIP3P water).
    • Run a short (10-50 ns) equilibrium simulation in a common MD package (e.g., GROMACS, OpenMM).
    • Analyze trajectories for:
      • Overall stability (Cα RMSD plateau).
      • Preservation of active site geometry (distance/orientation of fixed catalytic residues).
      • Approximate folding free energy calculations (e.g., using MM-PBSA).
  • Final Selection:

    • Select 20-50 sequences that pass all computational filters for gene synthesis and experimental expression. Prioritize sequence diversity to sample different regions of sequence space.

The evolution from Rosetta to neural networks like ProteinMPNN represents a shift from precise, laborious calculation to rapid, intelligent sampling. For de novo enzyme design, ProteinMPNN is not used in isolation but as a powerful component within a larger pipeline that includes structural generation (RFdiffusion) and rigorous in silico validation (AlphaFold2, MD). This integrated approach, leveraging the strengths of each tool, significantly accelerates the design-test cycle, bringing the goal of rationally engineered enzymes closer to reality.

Application Notes

ProteinMPNN is a robust, message-passing neural network for protein sequence design. Developed as a successor to sequence-design tools like Rosetta and ProteinGAN, it addresses the inverse folding problem: given a protein backbone structure, predict an amino acid sequence that will fold into that structure. Its primary application within de novo enzyme design is to generate highly stable, diverse, and functional sequences that adopt a specified catalytic scaffold, thereby accelerating the creation of novel biocatalysts.

The network's performance is benchmarked on protein structure recovery tasks, demonstrating state-of-the-art performance across diverse protein folds.

Table 1: ProteinMPNN Performance Metrics on CATH 4.2 Test Set

Metric ProteinMPNN (Reported) Baseline (e.g., Rosetta) Notes
Sequence Recovery (%) 52.4% ~35-40% Percentage of amino acids correctly predicted.
Perplexity 6.5 >15 Lower perplexity indicates higher confidence and accuracy.
Design Speed ~200 seqs/second ~1 seq/hour Enables high-throughput in silico sequence generation.
Native Sequence Rank Top-10 for >80% of proteins Lower Native sequence is often among the top-scoring predictions.
Diversity (pLDDT > 70) High Moderate Generates many high-confidence, stable sequences.

Table 2: Key Architectural Hyperparameters

Component Setting / Value Function
Encoder Layers 3 Encodes geometric and chemical features of the backbone.
Decoder Layers 3 Autoregressively decodes (predicts) the amino acid sequence.
Hidden Dimension 256 Size of the latent node and edge representations.
Attention Heads 16 Number of heads in the message-passing attention mechanism.
Training Epochs ~100 Trained on ~18,000 high-resolution PDB structures.

Experimental Protocols

Protocol: Generating Sequences for a Target Enzyme Scaffold Using ProteinMPNN

Objective: To use ProteinMPNN to design novel amino acid sequences that are predicted to fold into a given enzyme backbone structure (e.g., a TIM-barrel for a novel hydrolase).

Materials & Software:

  • Target protein backbone file (.pdb format).
  • ProteinMPNN installation (via GitHub: https://github.com/dauparas/ProteinMPNN).
  • Python environment (>=3.8, with PyTorch).
  • AlphaFold2 or RoseTTAFold installation for in silico validation.

Procedure:

  • Input Preparation:
    • Clean your target .pdb file. Remove heteroatoms, water molecules, and alternative conformations. Keep only the backbone atoms (N, CA, C, O) and CB for each residue if available.
    • Define fixed and mutable positions. For enzyme design, catalytic residues and key structural motifs (e.g., disulfide bonds) are often fixed. Create a chain_list.json file specifying which residues are to be designed.
  • Run ProteinMPNN:

    • Execute the main design script from the command line:

    • Key Parameters: num_seq_per_target controls throughput; sampling_temp (typically 0.1-0.15) controls diversity vs. confidence; lower temperature yields more conservative designs.

  • Post-Processing and Filtering:

    • The output is a FASTA file with 500 designed sequences.
    • Filter sequences based on ProteinMPNN's per-residue confidence scores (log probabilities). Discard sequences with many low-probability residues.
    • Cluster sequences (e.g., using MMseqs2) at ~60-70% identity to select a diverse subset (e.g., 50-100 sequences).
  • In Silico Validation (Essential for Thesis Research):

    • Folding Prediction: Use AlphaFold2 or ESMFold to predict the 3D structure of each filtered designed sequence.
    • Structural Alignment: Superimpose the predicted structure (model_predicted.pdb) onto the original target scaffold (scaffold_target.pdb) using TM-align or PyMOL. Calculate the Root-Mean-Square Deviation (RMSD) of the backbone atoms.
    • Stability Assessment: Use predictors like pLDDT (from AlphaFold2) or Rosetta ddG to estimate folding stability.
    • Function Prediction: For enzymes, use tools like DeepFRI or CLEAN to predict Enzyme Commission (EC) numbers from the designed sequence.

Protocol: Fine-Tuning ProteinMPNN on Enzyme Families

Objective: To specialize the general ProteinMPNN model for a specific enzyme fold (e.g., flavin-dependent monooxygenases) to improve design quality for that class.

Procedure:

  • Curate a Custom Dataset: From the PDB, collect all high-resolution (<2.5 Å) structures belonging to your target enzyme fold. Split into training (80%), validation (10%), and test (10%) sets.
  • Prepare Data in ProteinMPNN Format: Convert each .pdb to the required feature format (backbone coordinates, edges, etc.) using the provided preprocessing scripts.
  • Transfer Learning: Load the pre-trained ProteinMPNN weights. Replace the final output layer if the classification task changes.
  • Training Loop: Train the model on your custom dataset, monitoring validation loss to avoid overfitting. Use a low learning rate (e.g., 1e-5).
  • Evaluation: Benchmark the fine-tuned model on the held-out test set and compare sequence recovery and perplexity against the base model.

Core Architecture and Signaling Pathways Visualization

ProteinMPNN Architecture Overview

G Start Target Enzyme Scaffold (Backbone PDB) MPNN ProteinMPNN Sequence Design Start->MPNN SeqPool Pool of Designed Sequences (FASTA) MPNN->SeqPool Filter1 Filter by MPNN Confidence SeqPool->Filter1 Filter2 Cluster for Diversity Filter1->Filter2 Validate In Silico Folding (AlphaFold2) Filter2->Validate Success Stable, Scaffold-Matching Sequence Validate->Success  Low RMSD  High pLDDT LoopBack Re-design mutable positions Validate->LoopBack  High RMSD  Low pLDDT LoopBack->MPNN

Enzyme Design and Validation Workflow

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Resources for ProteinMPNN-Driven Enzyme Design

Item / Resource Category Function & Relevance
ProteinMPNN Software Computational Tool Core sequence design engine. Provides command-line interface for design and fine-tuning.
AlphaFold2 / ColabFold Validation Tool Critical for in silico validation. Predicts the 3D structure of designed sequences to verify fold fidelity.
PyRosetta Modeling Suite Used for advanced structural analysis, energy scoring (ddG), and complementary design approaches.
Custom Enzyme PDB Dataset Training Data For fine-tuning ProteinMPNN. Requires carefully curated, non-redundant structures of target enzyme fold.
MMseqs2 / CD-HIT Bioinformatics Tool Clusters designed sequences to ensure diversity before costly experimental validation.
TM-align / PyMOL Structural Analysis Calculates RMSD between designed and target scaffolds to quantify design success.
NVIDIA GPU (A100/V100) Hardware Accelerates both ProteinMPNN design and subsequent AlphaFold2 validation steps.
Gene Synthesis Service Wet-Lab Reagent Converts top-ranking in silico validated DNA sequences into physical plasmids for expression.
HEK293 or E. coli Expression System Wet-Lab Reagent Standard protein expression systems to produce and purify designed enzyme variants.
Activity Assay Kits (e.g., Fluorogenic Substrates) Wet-Lab Reagent Validates the catalytic function of the expressed, designed enzymes.

1.0 Application Notes: Core Functional Distinctions

ProteinMPNN and AlphaFold represent two distinct, non-competing paradigms in computational protein science. AlphaFold is a structure prediction tool that infers a protein's 3D conformation from its amino acid sequence. In contrast, ProteinMPNN is an inverse folding or sequence design tool that predicts amino acid sequences likely to fold into a given 3D protein backbone structure. Within a thesis on de novo enzyme design, AlphaFold is used to validate proposed structures, while ProteinMPNN is used to generate viable sequences for a target functional scaffold.

Table 1: Quantitative Comparison of Core Functions

Feature AlphaFold2 ProteinMPNN
Primary Task Sequence → Structure (Prediction) Structure → Sequence (Design)
Typical Input Amino acid sequence (string) Protein backbone coordinates (PDB)
Typical Output Predicted 3D coordinates, per-residue confidence (pLDDT) One or multiple plausible amino acid sequences
Key Model Architecture Evoformer & Structure Module (Transformer-based) Message-Passing Neural Network (MPNN)
Inference Speed Minutes to hours per target ~200 sequences/second (for ~100 aa)
Training Data PDB & UniProt (sequences & MSA) Native protein structures from PDB
Role in Enzyme Design Validation & Analysis: Assess folding of designed sequences. Generation: Create sequences for a target active site geometry.

2.0 Protocols for Integrated Use in De Novo Enzyme Design

Protocol 2.1: Iterative Sequence Design & Validation Cycle This protocol outlines the core experimental-computational pipeline for de novo enzyme design.

Materials & Reagent Solutions (The Scientist's Toolkit):

  • Target Scaffold (PDB File): A backbone structure, often a idealized fold or a redesigned natural scaffold, lacking side-chain identities.
  • ProteinMPNN (v1.1 or later): Locally installed or accessed via web server for sequence generation.
  • AlphaFold2 or AlphaFold3: For structure prediction, accessible via local ColabFold implementation or public server.
  • ROSETTA or FoldX: For side-chain packing, energy scoring, and structural refinement.
  • Cloning & Expression Kit (e.g., NEB Gibson Assembly, T7 Expression System): For synthesizing and expressing designed gene sequences.
  • Analytical Size-Exclusion Chromatography (SEC): To assess solution-state oligomerization and aggregation.
  • Circular Dichroism (CD) Spectrometer: For rapid assessment of secondary structure content and thermal stability.
  • Fluorometric Activity Assay: Custom assay using a fluorogenic substrate analog to probe designed enzyme function.

Procedure:

  • Input Preparation: Prepare a clean backbone PDB file. Define fixed and designed positions (e.g., fix catalytic triad residues, design surrounding pocket).
  • Sequence Generation with ProteinMPNN:
    • Run ProteinMPNN with the backbone, specifying designable positions.
    • Generate 100-500 sequences. Use temperature parameter (e.g., 0.1 for conservative, 0.3 for diverse sampling).
    • Output: A FASTA file of candidate sequences.
  • Folding Validation with AlphaFold:
    • Input candidate sequences into AlphaFold/ColabFold (using --num_recycle 3 --num_models 5).
    • Analyze the predicted structures. Filter sequences where the top-ranked model (highest pLDDT) recapitulates the target backbone (RMSD < 2.0 Å).
  • Energy Scoring & Filtering:
    • Use ROSETTA's ddg_monomer or FoldX to calculate stability energy (ΔΔG) for designed sequences threaded onto the scaffold.
    • Filter for sequences with favorable folding energy (ΔΔG < 0).
  • Experimental Characterization:
    • Synthesize genes for top 5-10 designs, express in E. coli, and purify via affinity chromatography.
    • Perform SEC and CD to confirm monodispersity and proper folding.
    • Test activity using the fluorometric assay.

Diagram: Enzyme Design Workflow

G Start Target Backbone (PDB) MPNN ProteinMPNN (Sequence Design) Start->MPNN SeqPool Pool of Candidate Sequences (FASTA) MPNN->SeqPool AF AlphaFold (Folding Validation) SeqPool->AF Filter1 Filter: pLDDT > 80 & RMSD < 2.0 Å AF->Filter1 ROSETTA ROSETTA/FoldX (Energy Scoring) Filter1->ROSETTA Filter2 Filter: ΔΔG < 0 ROSETTA->Filter2 Designs Top Designed Sequences Filter2->Designs Lab Experimental Characterization Designs->Lab Cycle Iterative Redesign Lab->Cycle If Failed Cycle->Start Refine Backbone or Positions

Title: Computational-Experimental Design Pipeline

Protocol 2.2: Assessing Sequence-Structure Compatibility This protocol quantitatively compares ProteinMPNN's recovery of native-like sequences versus AlphaFold's recovery of native-like structures.

Procedure:

  • Dataset Curation: Select a non-redundant set of 100 high-resolution (<2.0 Å) enzyme structures from the PDB.
  • Native Sequence Recovery (ProteinMPNN):
    • For each native structure, strip side-chain identities (keep Cα, C, N, O).
    • Input backbone into ProteinMPNN to predict the optimal sequence.
    • Calculate % recovery of the true native amino acids at each position.
    • Expected Result: ProteinMPNN typically achieves ~40-60% native sequence recovery on native backbones.
  • Native Structure Recovery (AlphaFold):
    • For each corresponding native amino acid sequence, run AlphaFold2.
    • Compare the top-ranked predicted structure to the experimental (native) structure using TM-score and Cα-RMSD.
    • Expected Result: AlphaFold2 typically achieves TM-score >0.9 (near-perfect) for most single-domain proteins.
  • Analysis: Tabulate results. This experiment highlights the asymmetry: a given sequence strongly dictates structure (AlphaFold's high accuracy), but a single structure can be encoded by many sequences (ProteinMPNN's diverse output).

Diagram: Logic of the Inverse Folding Problem

Title: Sequence-Structure Relationship Mapping

Table 2: Typical Protocol Output Metrics

Protocol Primary Metric (ProteinMPNN) Primary Metric (AlphaFold) Success Threshold (Typical)
2.1: Design Cycle Sequence Diversity & Energy Score pLDDT & RMSD to Target pLDDT > 80, RMSD < 2.0 Å
2.2: Compatibility Native Sequence Recovery (%) TM-score vs. Native Structure Recovery ~52%, TM-score >0.9

This integrated framework positions ProteinMPNN as the generative engine for sequence space exploration, with AlphaFold serving as a critical in silico validator, forming a closed-loop pipeline for actionable de novo enzyme design.

This application note details the essential prerequisites for de novo enzyme design using ProteinMPNN, framed within a broader thesis on advancing machine-learning-driven protein engineering. The successful application of ProteinMPNN for generating functional enzyme sequences is contingent upon the careful preparation of input scaffolds and the rigorous evaluation of output sequence proposals. This document provides current protocols and specifications to guide researchers in structuring their design campaigns.

Required Inputs: Backbone Scaffolds

The primary input for ProteinMPNN is a fixed protein backbone scaffold. The quality and appropriateness of this scaffold directly determine the feasibility and quality of the proposed sequences.

Table 1: Essential Characteristics of Input Backbone Scaffolds

Parameter Specification Rationale & Impact on Output
Source Solved crystal/NMR structures, high-quality AlphaFold2 or RoseTTAFold predictions, or designed de novo folds. Defines the target topology. Experimental structures are preferred; predicted structures require high pLDDT confidence (>85) in core regions.
Format PDB file format (standard). The standard input format for ProteinMPNN and related structure analysis tools.
Chain Handling Single chain or multi-chain complexes, with chains explicitly defined. ProteinMPNN can design for specific chains, enabling interface design.
Completeness No missing backbone heavy atoms (N, Cα, C, O). Missing side chains are acceptable. The neural network operates on defined backbone coordinates. Gaps will cause errors.
Fixed Positions A user-defined list of residue indices that will remain unchanged (e.g., catalytic triads, binding site anchors, capping residues). Critical for preserving functional motifs or structural integrity. Defined via a list or a mask string.
Designed Positions A user-defined list of residue indices to be redesigned. Enables global or local sequence design. Typically, all non-fixed positions are designated for design.
Secondary Structure Should match the intended design (e.g., catalytic pockets often reside in loops between defined secondary elements). Scaffold must spatially position functional elements correctly.

Protocol 2.1: Preparing a Backbone Scaffold for ProteinMPNN Input

  • Obtain Structure: Source a PDB file (e.g., 7BEN.pdb) from the RCSB PDB or generate one from a prediction server.
  • Clean the File: Remove water molecules, heteroatoms (unless critical metal ions), and alternative conformations using molecular visualization software (PyMOL, ChimeraX).
  • Define Chains: Ensure chain identifiers (A, B, etc.) are correct for multi-chain designs.
  • Identify Fixed Residues:
    • Analyze the scaffold to identify residues critical for function (e.g., catalytic residues, cofactor binders) or structure (e.g., disulfide-bonded cysteines, prolines in turns).
    • Create a list of these residue numbers (e.g., [55, 87, 142]) or a mask string where 'F' denotes fixed and 'T' denotes designed (e.g., 'FFTTTTTTFF').
  • Validate Backbone Geometry: Use MolProbity or PHENIX to check for Ramachandran outliers and severe clashes. Repair drastic outliers, as they represent unrealistic geometries.

G Start Start: Source Structure Clean Clean PDB File (Remove H2O, ligands) Start->Clean Define Define Chain Identifiers Clean->Define Analyze Analyze Functional/ Structural Motifs Define->Analyze FixList Generate List of Fixed Residues Analyze->FixList Validate Validate Backbone Geometry FixList->Validate End End: Validated Scaffold Ready for ProteinMPNN Validate->End

Title: Workflow for Preparing a Backbone Scaffold.

Generated Outputs: Sequence Proposals

ProteinMPNN generates multiple sequence proposals (variants) that are predicted to fold into the input backbone scaffold.

Table 2: Characteristics and Evaluation Metrics for Output Sequence Proposals

Output Component Description Typical Range/Format
Designed Sequences Amino acid sequences (FASTA format) for the designed positions. Multiple sequences per run (e.g., 8, 100, or 1000).
Sequence Log-Probability The model's per-residue and total confidence score (negative log probability). Higher (less negative) indicates higher model confidence. Typically between -1.0 and -4.0 per residue; total sum varies by length.
Amino Acid Probabilities For each position, the probability distribution over all 20 amino acids. Provided in parsed output files (e.g., .npz format).
Sequence Diversity Measured by pairwise identity between generated sequences. Can be controlled by sampling temperature (T parameter). Low T (e.g., 0.1): low diversity, high probability. High T (e.g., 0.5): high diversity.

Protocol 3.1: Generating and Parsing ProteinMPNN Outputs

  • Run ProteinMPNN: Execute via command line or script. Example command:

  • Parse Output Files: Key files in the results folder:

    • seqs/my_scaffold.fa: FASTA of designed sequences.
    • seqs/my_scaffold_score.npz: NumPy file containing sequence scores, log probabilities, and amino acid probabilities.
  • Initial Filtering: Filter sequences based on:
    • Total sequence score (select top 10-20% by score).
    • Absence of proline/glycine in disallowed secondary structures (if known).
    • Preservation of desired residue properties (e.g., charge, hydrophobicity) in key regions.

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials and Tools for ProteinMPNN-Based Enzyme Design

Reagent / Tool Supplier / Source Function in Workflow
ProteinMPNN Software GitHub Repository (https://github.com/dauparas/ProteinMPNN) Core neural network for sequence design.
PyMOL or ChimeraX Schrödinger / UCSF Visualization, PDB file cleaning, and structural analysis.
AlphaFold2 Colab DeepMind / Colab Generating high-confidence predicted structures for novel scaffolds.
Rosetta Software Suite University of Washington For energy minimization of input scaffolds and in silico folding validation (ddG calculation) of output sequences.
MolProbity Server Duke University Validation of input scaffold geometry (ramachandran, clashes).
PyTorch & Dependencies PyTorch.org Required machine learning framework to run ProteinMPNN.
Custom Python Scripts In-house development For parsing outputs, generating sequence masks, and batch analysis.
Gene Synthesis Services Twist Bioscience, GenScript, etc. Converting in silico sequence proposals into physical DNA for experimental testing.

Protocol 4.1: Integrated Workflow from Scaffold to Experimental Test

  • Scaffold Inception: Define catalytic geometry and fold. Source or generate a backbone scaffold (Protocol 2.1).
  • Sequence Design: Run ProteinMPNN with defined fixed residues to generate 100-1000 sequence proposals.
  • In Silico Downselection:
    • Filter by ProteinMPNN score (Protocol 3.1).
    • Use AlphaFold2 to predict the structure of each filtered sequence de novo.
    • Compute the RMSD between the AF2 prediction and the original target scaffold. Select sequences with low RMSD (<1.5Å).
    • Optionally, use Rosetta relax/ddg to estimate folding stability.
  • Experimental Validation: Synthesize genes for 5-20 top designs, express in a suitable host (e.g., E. coli), purify, and assay for target enzyme activity.

G Thesis Thesis Aim: De Novo Enzyme Design Input Required Input: Curated Backbone Scaffold Thesis->Input Process ProteinMPNN Sequence Proposal Engine Input->Process PDB + Fixed Residues Output Generated Output: Sequence Proposals (FASTA) Process->Output Scores & Probabilities Filter In Silico Filtering (AF2, Rosetta) Output->Filter Top 100 Sequences Test Experimental Expression & Assay Filter->Test 5-20 Designs Data Functional Enzyme Design & Thesis Conclusion Test->Data

Title: Logical Flow of ProteinMPNN Enzyme Design Thesis.

Foundational Research and Benchmark Studies Establishing ProteinMPNN's Efficacy

Within the broader thesis on utilizing ProteinMPNN for de novo enzyme sequence design, establishing its foundational efficacy is paramount. This application note synthesizes key benchmark studies that validated ProteinMPNN as a superior neural network for protein sequence design, enabling robust downstream research in enzyme engineering and therapeutic development.

Key Benchmark Findings

The primary validation study by Dauparas et al. (2022) demonstrated ProteinMPNN's state-of-the-art performance across multiple challenging design tasks. Quantitative results are summarized below.

Table 1: ProteinMPNN Benchmark Performance Summary

Benchmark Task Metric ProteinMPNN Result Previous Best (RFdesign) Key Implication
Native Sequence Recovery Recovery on PDB structures 52.4% 32.9% Superior capture of native sequence constraints.
Fixed-Backbone Design Success Rate (≤2Å RMSD) 62.5% 46.5% Higher reliability in core enzyme design scenarios.
Symmetric Oligomer Design Experimental Validation Success 18/24 (75%) Not Systematically Reported Robust design of complex quaternary structures.
Binding Motif Scaffolding Success Rate (≤2Å RMSD) 87.5% 72.5% Effective for designing functional enzyme active sites.
Inverse Folding Speed Sequences per Second (GPU) ~100 ~1 Enables large-scale library generation for enzyme screening.

Experimental Protocol: Fixed-Backbone Sequence Redesign

This protocol details the core benchmark experiment for evaluating sequence recovery and design accuracy.

Objective: To redesign amino acid sequences for a given protein backbone structure and evaluate recovery of the native sequence and structural fidelity.

Materials & Reagents:

  • Input Data: Target protein backbone structure in PDB format (e.g., 1ubq.pdb for ubiquitin).
  • Software: ProteinMPNN installed via provided GitHub repository.
  • Computing Environment: GPU (e.g., NVIDIA V100, A100) recommended for batch processing.
  • Analysis Tools: PyMOL, RosettaFold2 or AlphaFold2 for structure prediction, PyRosetta for RMSD calculation.

Procedure:

  • Data Preparation: Isolate the target chain and clean the PDB file, removing heteroatoms and ensuring standard atom names.
  • Run ProteinMPNN:

  • Sequence Analysis: Calculate native sequence recovery from the generated sequences (seqs/1ubq.fas).
  • Structure Validation: For each designed sequence:
    • Predict the de novo structure using AlphaFold2 or RosettaFold2, feeding the designed sequence and using the original backbone as a template with strict constraints.
    • Superimpose the predicted structure onto the original backbone using Cα atoms.
    • Calculate the Cα root-mean-square deviation (RMSD).
  • Success Criterion: A design is considered successful if the RMSD ≤ 2.0 Å, indicating the sequence folds into the intended backbone.

The Scientist's Toolkit

Table 2: Essential Research Reagent Solutions for ProteinMPNN Benchmarks

Item Function & Relevance
PDB Structure Files Source of fixed-backbone targets for redesign; ground truth for native sequence recovery metrics.
Pre-trained ProteinMPNN Weights Core neural network parameters enabling fast, high-quality sequence design without task-specific training.
AlphaFold2 / RosettaFold2 Critical for in silico validation; predicts the 3D structure of designed sequences to verify fold fidelity.
PyRosetta or BioPython Software suites for calculating structural metrics (RMSD, DSSP) and automating analysis pipelines.
HEK293 or E. coli Expression Systems For experimental validation of designed proteins; express and purify designs for biophysical characterization.
Size-Exclusion Chromatography (SEC) Assesses monomeric state and solubility of expressed designs, a primary indicator of folding success.
Circular Dichroism (CD) Spectrometer Validates secondary structure content matches the target fold (e.g., α-helical bundles, β-sheets).

ProteinMPNN Benchmark Validation Workflow

The following diagram outlines the logical flow and key decision points in a standard ProteinMPNN efficacy benchmark study.

proteinmpnn_benchmark Start Input Target Backbone (PDB File) ProteinMPNN Run ProteinMPNN Sequence Design Start->ProteinMPNN SeqOutput Designed Sequences (FASTA) ProteinMPNN->SeqOutput AF2 Structure Prediction (AlphaFold2/RF2) SeqOutput->AF2 Per Sequence Metric1 Sequence Recovery % SeqOutput->Metric1 Vs. Native Compare Structural Alignment & RMSD Calculation AF2->Compare Metric2 Design Success (RMSD ≤ 2.0 Å) Compare->Metric2 End Validated Design Protocol Metric1->End Metric2->End

ProteinMPNN Benchmark Validation Workflow

Signaling Pathway for Enzyme Design Application

This diagram conceptualizes how ProteinMPNN integrates into a broader de novo enzyme design thesis, connecting sequence generation to functional validation.

enzyme_design_pathway Backbone Designed Enzyme Active Site Scaffold ProteinMPNN ProteinMPNN Sequence Design Backbone->ProteinMPNN Fixed Backbone Library Sequence Library ProteinMPNN->Library AF2 In Silico Folding & Filtering Library->AF2 Folding Confidence & Stability Check Candidates High-Confidence Design Candidates AF2->Candidates Top Ranking Experiment Experimental Expression & Purification Candidates->Experiment Assay Functional Activity Assay Experiment->Assay Soluble Protein ValidatedEnzyme De Novo Enzyme Assay->ValidatedEnzyme Positive Signal

Enzyme Design Thesis Application Pathway

How to Use ProteinMPNN: A Step-by-Step Guide for Enzyme Design Projects

Within a research thesis focused on de novo enzyme sequence design using ProteinMPNN, the selection and preparation of input backbone structures is the critical first step. ProteinMPNN designs sequences that are compatible with a given backbone scaffold, meaning the quality and appropriateness of the Protein Data Bank (PDB) file directly determine the feasibility and functionality of the designed enzymes. This document provides application notes and protocols for sourcing, curating, and formatting backbone PDB files to serve as optimal inputs for ProteinMPNN-driven enzyme design pipelines.

Sourcing Backbone Structures: Considerations and Protocols

The objective is to identify protein scaffolds with structural features conducive to the desired enzymatic function (e.g., active site geometry, binding pockets, oligomeric state).

Protocol 1.1: Targeted Backbone Retrieval from the PDB

  • Define Scaffold Criteria: List required parameters (Table 1).
  • Utilize the RCSB PDB Advanced Search Interface: Apply filters corresponding to your criteria.
  • Evaluate and Shortlist: Download candidate PDB files and perform initial visual inspection in software like PyMOL or ChimeraX to confirm key features.
  • Record Metadata: Maintain a lab notebook or spreadsheet tracking the rationale for each selected structure.

Table 1: Key Criteria for Scaffold Selection

Criterion Typical Target for Enzyme Design Rationale
Resolution ≤ 2.5 Å Higher confidence in atomic coordinates and backbone geometry.
Organism Source Thermostable organisms (e.g., Thermus thermophilus) Scaffolds often exhibit higher thermal stability.
Presence of Cofactors As required by reaction mechanism Essential for designing functional active sites.
Oligomeric State Monomer or multimer as needed ProteinMPNN can design for symmetry; correct state is crucial.
Absence of Tags/Fusions Prefer native structures Prevents interference with designed folding.

Protocol 1.2: Generating De Novo Backbones with RFdiffusion or RoseTTAFold For novel folds not found in the PDB, de novo backbone generation is used.

  • Input Conditioning: Define target fold via a conditioning motif (e.g., partial structure) or descriptive prompts.
  • Run RFdiffusion: Use the tool to generate an ensemble of possible backbone structures (e.g., 100 models).
  • Cluster and Select: Cluster models based on RMSD and select centroids representing diverse, well-folded geometries.
  • Refine with Rosetta Relax or AlphaFold2: Minimize the physical realism and steric clashes of selected de novo backbones.
  • Output: Save the final refined model in PDB format for downstream processing.

Formatting and Preprocessing PDB Files for ProteinMPNN

Raw PDB files often require cleaning and standardization to ensure compatibility with ProteinMPNN.

Protocol 2.1: Essential PDB Cleaning and Standardization

  • Remove Non-Protein Entities: Strip out water molecules, ions, bulk solvent, and small molecule ligands unless they are critical cofactors. For cofactors, convert to a canonical residue name (e.g., HEM).
  • Handle Multiple Models: For NMR ensembles or computational models, select a single representative model (usually the first).
  • Standardize Chain IDs and Residue Numbering: Ensure chain IDs are single characters (A, B, C). Consider renumbering residues sequentially from 1 for each chain to avoid errors.
  • Retain Only Essential Atoms: Keep only backbone atoms (N, CA, C, O) and CB. ProteinMPNN primarily uses backbone and CB positions. Remove other side-chain atoms.
  • Ensure a Continuous Backbone: Check for and address missing residues within the design region. Gaps may require modeling with tools like Modeller.

Protocol 2.2: Defining Designable and Fixed Regions (The Mask) ProteinMPNN requires a specification of which residues to redesign (designable) and which to hold fixed.

  • Create a B-factor Column Mask: In the cleaned PDB file, modify the B-factor column. Set B-factor to 1.00 for residues to be designed and 0.00 for residues to be fixed.
  • Typical Masking Strategy:
    • Fixed: Catalytic residues, cofactor-binding residues, structurally critical residues (e.g., disulfide bridges).
    • Designable: The rest of the scaffold, especially surfaces and loops for substrate binding or altered properties.
  • Save the Final Prepared PDB: This file, with cleaned atoms and the B-factor mask, is the direct input for ProteinMPNN.

G RawPDB Raw PDB File (RCSB or De Novo) Step1 1. Remove Solvent/ Non-Essential Ligands RawPDB->Step1 Step2 2. Select Single Model & Standardize Chains Step1->Step2 Step3 3. Trim to Backbone Atoms (N,CA,C,O,CB) Step2->Step3 Step4 4. Insert B-factor Mask (1.00=Design, 0.00=Fixed) Step3->Step4 Final Final Prepared PDB for ProteinMPNN Step4->Final

Title: PDB File Preprocessing Workflow for ProteinMPNN.

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Tools for Backbone Preparation

Tool / Resource Primary Function Application in Protocol
RCSB Protein Data Bank Repository of experimentally solved 3D structures. Source of initial backbone scaffolds (Protocol 1.1).
PyMOL / UCSF ChimeraX Molecular visualization and analysis software. Visual inspection, cleaning, and masking of PDB files.
RFdiffusion Generative AI for de novo protein backbone creation. Generating novel scaffold structures (Protocol 1.2).
AlphaFold2 Protein structure prediction tool. Refining and validating de novo or gapped structures.
Rosetta Relax Molecular modeling for structure refinement. Energy minimization and steric clash removal.
Biopython PDB Module Python library for PDB file manipulation. Programmatic parsing, cleaning, and masking of PDB files.
ProteinMPNN Protein sequence design neural network. Final recipient of the prepared PDB file for sequence design.

Validation of Prepared Backbones

Prior to full-scale design, validate the prepared input.

Protocol 3.1: Pre-Design Backbone Validation

  • Run AlphaFold2 on a Native Sequence: Thread the original sequence (if available) onto the prepared backbone and run AlphaFold2. A high pLDDT score (>90) and low RMSD to the input confirms the scaffold is foldable.
  • Check Structural Integrity: Use Rosetta's score_jd2 or MolProbity to assess Ramachandran outliers, rotamer outliers, and steric clashes. A clean structure is imperative.
  • Verify the Mask: Visually confirm in PyMOL that the B-factor column correctly highlights intended designable regions.

H PreparedPDB Prepared PDB File Val1 AlphaFold2 Validation (pLDDT, RMSD) PreparedPDB->Val1 Val2 Steric & Torsion Check (Rosetta/MolProbity) PreparedPDB->Val2 Val3 Visual Mask Inspection (PyMOL/ChimeraX) PreparedPDB->Val3 Pass Validation Pass? Val1->Pass Val2->Pass Val3->Pass Ready Ready for ProteinMPNN Design Pass->Ready Yes Fail Return to Preprocessing Pass->Fail No

Title: Validation Pipeline for ProteinMPNN Input Backbones.

Within the broader thesis on de novo enzyme sequence design, ProteinMPNN serves as the pivotal computational tool for generating functional, foldable amino acid sequences for predetermined backbone scaffolds. This protocol details the command-line execution and critical parameter tuning necessary for robust sequence design, a foundational step in the computational enzyme design pipeline.

The efficacy of ProteinMPNN in enzyme design is governed by several tunable parameters. The table below summarizes the core parameters, their default values, typical ranges used in enzyme design, and their primary impact on output.

Table 1: Core ProteinMPNN Parameters for Enzyme Design

Parameter Default Value Recommended Range for Enzymes Function & Impact on Design
--num_seq 1 10-100 Number of independent sequences to generate per backbone. Higher values increase diversity for screening.
--sampling_temp 0.1 0.01 - 0.3 Controls randomness; lower temps favor high-probability (conservative) sequences, higher temps increase exploration.
--seed 0 Any integer Sets random seed for reproducible designs. Critical for experimental validation.
--batch_size 1 1-8 Number of backbones to process in parallel. Higher values speed up computation if memory permits.
--model_type 'v48020' 'v48020', 'v48010', 'soluble' Model weights. 'soluble' is tuned for soluble, globular proteins.
--use_soluble_model False True/False Force use of the soluble-protein fine-tuned model.
--omit_AAs 'X' e.g., 'C' to disallow Cys List of amino acid single-letter codes to exclude from design.
--bias_AA None e.g., 'A:2.5' Biases the probability of specific AAs. Format: 'A:2.5' multiplies Ala probability by 2.5.
--bias_by_res None Path to .json file Per-residue, per-AA bias specification for precise functional site control.

Detailed Command-Line Protocol

This protocol assumes a local installation of ProteinMPNN from its official GitHub repository and a prepared protein backbone in PDB format.

Protocol 3.1: Basic Single-Backbone Sequence Design

Objective: Generate 50 novel sequences for a single enzyme scaffold. Materials:

  • Input Backbone: scaffold.pdb
  • ProteinMPNN Environment: Python/conda environment with dependencies installed.
  • Computational Resources: Machine with GPU (CUDA) recommended.

Methodology:

  • Navigate to the ProteinMPNN directory in your terminal.
  • Run the following command:

  • Output Files: The ./outputs folder will contain:
    • seqs/scaffold.fa: FASTA file of the 50 designed sequences.
    • parsed_pdbs/scaffold.jsonl: Log file with per-residue log probabilities for each sequence.

Protocol 3.2: Design with Functional Site Constraints

Objective: Design sequences while restricting the identity of catalytic residues (e.g., positions 45, 46, 47 as His-Asp-Ser) and biasing the entire sequence for alanine. Materials:

  • Bias File: bias_by_res.json (see below for creation).

Methodology:

  • Create a bias specification JSON file. For a 100-residue protein where indices 45,46,47 are fixed and all positions are biased for Ala:

  • Run ProteinMPNN with the bias file:

Visualization of Workflows

G Start Input: Backbone Scaffold (PDB) Params Parameter Setup (model, temp, bias) Start->Params Run ProteinMPNN Run (Neural Network Sampling) Params->Run Output Output: Designed Sequences (FASTA) Run->Output Filter Downstream Analysis (Folding, Scoring, Ranking) Output->Filter

Diagram 1: Core ProteinMPNN Enzyme Design Workflow (76 chars)

G cluster_params Key Parameters PDB Backbone PDB Prep Structure Preparation (Add missing atoms, etc.) PDB->Prep Feat Feature Extraction (Dihedrals, distances, etc.) Prep->Feat MPNN ProteinMPNN Model (Message Passing Neural Net) Feat->MPNN Sample Sequence Sampling (Auto-regressive decoder) MPNN->Sample FASTA Designed FASTA Sequences Sample->FASTA sampling_temp sampling_temp , fillcolor= , fillcolor= B bias_AA / bias_by_res M model_type T T

Diagram 2: ProteinMPNN Internal Dataflow & Parameter Integration (81 chars)

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials & Computational Reagents for ProteinMPNN Experiments

Item/Reagent Function in Protocol Notes for Enzyme Design
Protein Backbone (PDB) The input 3D scaffold for sequence design. Often a de novo fold or a redesigned natural enzyme scaffold with the desired active site geometry.
ProteinMPNN Software Core sequence design engine. Must be cloned from GitHub. The soluble model is often preferred for globular enzymes.
Conda/Python Environment Isolated software environment. Ensures dependency version compatibility (PyTorch, etc.).
GPU (CUDA-capable) Hardware accelerator. Drastically reduces sampling time; essential for large-scale design (e.g., num_seq > 1000).
Bias Specification (JSON) Encodes positional constraints. Critical for encoding catalytic residues, disulfide bonds, or cofactor-binding motifs.
Downstream Filtering Software Evaluates design quality. Tools like AlphaFold2 (for structure validation) or Rosetta (for energy scoring) are used post-MPNN.
High-Performance Computing (HPC) Cluster For large batch processing. Required for designing across hundreds of scaffolds or generating massive sequence libraries.

Application Notes

Within the broader thesis on ProteinMPNN for de novo enzyme sequence design research, the critical challenge is generating functional sequences that not only fold into stable structures but also correctly position catalytic machinery. This involves constraining the sequence design process to incorporate predefined active site residues and cofactor-binding geometries. The following notes detail the application of these constraints using the ProteinMPNN paradigm.

Key Application Principle: ProteinMPNN operates as a neural network trained to predict amino acid probabilities given a protein backbone structure. For catalytic design, a subset of positions is "fixed" (i.e., their identities are predetermined and held constant during sequence generation). These include:

  • Catalytic Triads/Diads: Essential residues (e.g., Ser-His-Asp, Cys-His) involved in the chemical mechanism.
  • Cofactor-Coordinating Residues: Residues that directly ligate or interact with essential cofactors (e.g., heme iron, metal ions, NAD(P)H, PLP).
  • Structural Scaffold Residues: Residues critical for maintaining the precise spatial geometry required for catalysis, often surrounding the active site.

Quantitative Performance Data: The success of designs is typically evaluated by experimental expression, purification, and activity assays. The following table summarizes key metrics from recent studies incorporating active site constraints.

Table 1: Quantitative Outcomes of Constrained Enzyme Design Studies

Study Focus Fixed Residue Count Sequence Recovery (%)* Experimental Success Rate (%) Key Measured Activity (kcat/Km or relative rate)
Retro-aldolase Design 8-12 94.2 25 ~10³ M⁻¹s⁻¹ (best design)
Non-heme Iron Dioxygenase 6 (Fe ligands) + 4 91.7 40 0.02 - 0.05 s⁻¹ (product formation)
Kemp Eliminase (HG3) 3 (catalytic triad) 89.5 ~10 1.5 x 10⁵ M⁻¹s⁻¹ (optimized design)
De Novo Heme Binding 4 (heme ligation) + 2 96.0 65 Tight binding (Kd < 100 nM)

*Sequence recovery in the *variable regions compared to natural or parent sequences.* *Success rate defined by soluble expression and detectable catalytic activity.*

Protocols

Protocol 1: Defining and Encoding Active Site Constraints for ProteinMPNN

This protocol describes the crucial preparatory step of translating biochemical knowledge into a machine-readable format for ProteinMPNN input.

Materials:

  • Research Reagent Solutions & Essential Materials:
    • PDB Structure File (e.g., scaffold.pdb): A backbone structure (natural or de novo folded) to be designed.
    • Molecular Visualization Software (PyMOL, ChimeraX): For identifying and verifying residue positions.
    • Text Editor / Python Scripting Environment: To prepare constraint files.
    • List of Canonical Active Site Geometries: From databases like CATRES or Mechismo.
    • Cofactor Parameter File (if applicable): CIF or parameter file defining cofactor bond lengths and angles.

Methodology:

  • Identify Constrained Positions: Using the PDB file and literature on the target reaction, list all residues that must be preserved. This includes:
    • Direct catalytic residues.
    • Residues forming hydrogen bonds to transition state analogs.
    • Residues coordinating metal ions or specific atoms of a cofactor (e.g., the O1A and O2A atoms of NAD).
    • Residues within 4Å of the cofactor that define the binding pocket shape.
  • Create a Residue Mask File: Generate a simple list (e.g., fixed_residues.txt) specifying the chain ID and residue number (according to the PDB) for each constrained position. Example:

  • Create a Sequence Constraint File: For each constrained position, specify the allowed amino acid(s). This is a JSON dictionary where keys are "chain_resNum" and values are lists of allowed one-letter codes.

  • Validate Geometry: Using molecular visualization software, ensure the fixed residues in the scaffold structure are geometrically compatible (e.g., correct distances for hydrogen bonds, feasible metal coordination geometry).

Protocol 2: Running ProteinMPNN with Cofactor and Active Site Constraints

This protocol details the execution of the design process with the constraints defined in Protocol 1.

Materials:

  • Research Reagent Solutions & Essential Materials:
    • ProteinMPNN Installation: Local or server-based instance (v1.1 or later).
    • Prepared Constraint Files: From Protocol 1 (fixed_residues.txt, sequence_constraints.json).
    • Scaffold PDB File: The input backbone.
    • Computational Environment: Linux environment with CUDA-capable GPU recommended for speed.

Methodology:

  • Prepare the Input Directory: Place the scaffold PDB file and constraint files in a dedicated directory.
  • Execute ProteinMPNN with Flags: Run the protein_mpnn_run.py script with appropriate arguments to enforce constraints.

  • Generate Sequence Pool: The primary output (designs.json) will contain 200 designed sequences (per chain). Extract the FASTA sequences for downstream analysis.
  • Filter and Cluster: Use in-silico tools (e.g., SCUBA, HMMER) to filter sequences for properties like charge distribution, hydrophobicity near active site, and cluster to select diverse candidates for experimental testing.

Protocol 3: In-silico Validation of Cofactor Binding Geometry

Prior to experimental expression, this protocol screens designs for their ability to accommodate the required cofactor.

Materials:

  • Research Reagent Solutions & Essential Materials:
    • Designed Protein Models: Structures predicted via AlphaFold2 or RosettaFold for each designed sequence.
    • Cofactor 3D Coordinate File: PDB or MOL2 file of the cofactor in its active conformation.
    • Molecular Docking Software (AutoDock Vina, SMINA): For rigid or flexible docking.
    • Molecular Dynamics (MD) Simulation Suite (GROMACS, AMBER): For short MD relaxations.
    • Script for Geometry Analysis: Custom Python script using Bio.PDB or MDAnalysis.

Methodology:

  • Predict Designed Protein Structures: Run AlphaFold2 or RosettaFold on the top 20-50 designed FASTA sequences to generate full-atom models.
  • Rigid Docking: Dock the cofactor into the predicted active site pocket of each model using defined coordinate constraints to ensure the catalytic atoms are positioned correctly relative to the fixed residues.
  • Pose Relaxation and Scoring: Perform a short (5-10 ns) MD simulation or energy minimization with the cofactor bound. Analyze:
    • Stability of cofactor-protein interactions (RMSD).
    • Preservation of key distances (e.g., metal-ligand distances < 2.5 Å, hydrogen bond distances < 3.2 Å).
    • Energy of interaction (MM/GBSA scoring).
  • Rank Designs: Select the top 5-10 designs that maintain all critical geometric constraints for experimental characterization.

Diagrams

G PDB PDB ConstraintDef Define Active Site & Cofactor Constraints PDB->ConstraintDef SeqGen Run ProteinMPNN with Constraints ConstraintDef->SeqGen ModelPredict Predict Structures (AF2/RosettaFold) SeqGen->ModelPredict CofactorDock Cofactor Docking & Geometry Validation ModelPredict->CofactorDock Rank Rank & Select Top Designs CofactorDock->Rank Experiment Experimental Expression & Assay Rank->Experiment

Workflow for Catalytic Enzyme Design

G cluster_inputs Inputs to ProteinMPNN cluster_output Output & Validation Backbone Backbone Scaffold (PDB Coordinates) ProteinMPNN ProteinMPNN Neural Network Backbone->ProteinMPNN ConstraintMask Residue Mask (Fixed Positions) ConstraintMask->ProteinMPNN AAList Allowed Amino Acids per Position (JSON) AAList->ProteinMPNN SeqPool Pool of Designed Sequences (FASTA) ProteinMPNN->SeqPool Structure Predicted Fold with Active Site Intact SeqPool->Structure Cofactor Validated Cofactor Binding Pose Structure->Cofactor

Input/Output Flow of Constrained Design

Application Notes

ProteinMPNN has emerged as a powerful tool for de novo protein sequence design, enabling the generation of novel, functional enzymes. A critical research frontier involves steering this generative capacity toward sequences that not only fold into a target structure but also exhibit optimized biophysical properties critical for experimental validation and application, namely stability, solubility, and expression yield. This protocol details methods for integrating property prediction tools with ProteinMPNN’s inference cycle to achieve targeted, property-guided sequence design.

The core strategy involves a post-generation filtering or in-loop scoring approach. Multiple sequences are sampled from ProteinMPNN for a given backbone. These candidates are then rapidly scored by auxiliary neural networks trained to predict specific properties. The highest-scoring sequences for the desired property (e.g., higher stability, solubility) are selected for experimental testing. This method effectively disentangles the folding objective (handled by ProteinMPNN) from the property optimization objective (handled by the predictor).

Table 1: Performance of Property Prediction Tools for Filtering ProteinMPNN Outputs

Property Predictive Tool (Model) Key Metric Reported Performance (vs. Baseline) Use in Design Pipeline
Stability ProteinGCN (ΔΔG) Spearman's ρ ρ ~0.65 on deep mutation data Rank-order ProteinMPNN sequences by predicted ΔΔG.
Solubility SoluProt AUC-ROC >0.9 on solubility benchmark sets Filter out sequences predicted as insoluble.
Expressibility DeepESM (Localization/Expression) Accuracy >80% classification accuracy in E. coli Select sequences predicted for high expression.
Aggregation Aggrescan3D (3D Aggregation Propensity) Aggregation Score Identifies surface "hot spots" on structure Mutate aggregation-prone residues in fixed backbone.

Experimental Protocols

Protocol 1: Property-Guided Sequence Design with Filtering Objective: To generate sequences for a target enzyme backbone that are predicted to be stable and soluble.

Materials:

  • Target protein backbone (PDB file)
  • ProteinMPNN software (local or API)
  • Property prediction servers (e.g., SoluProt, ProteinGCN)
  • E. coli expression vector system

Procedure:

  • Backbone Preparation: Prepare your target enzyme backbone (e.g., a de novo fold or a natural scaffold). Clean the PDB file, ensuring proper chain separation.
  • ProteinMPNN Sampling: Run ProteinMPNN in stochastic sampling mode (num_seq > 1000) to generate a large, diverse sequence ensemble for the backbone. Use default or per-residue amino acid biases if prior functional motifs are required.
  • Property Prediction Batch Analysis: Submit the FASTA file of generated sequences to property prediction tools. For solubility, use SoluProt web server batch upload. For stability, use a local ProteinGCN instance to compute predicted ΔΔG relative to a reference.
  • Sequence Ranking & Selection: Compile results into a table. Rank sequences by a composite score (e.g., prioritize solubility prediction first, then stability). Select the top 20-50 sequences for synthesis.
  • Gene Synthesis & Cloning: Order genes as gBlocks or full-length syntheses. Clone into your preferred E. coli expression vector (e.g., pET series with a solubility tag like MBP or Trx).
  • Expression Test: Transform into expression strains (e.g., BL21(DE3)). Perform small-scale expression (5 mL cultures), induce with IPTG, and analyze total protein and soluble fraction via SDS-PAGE.

Protocol 2: In-Loop Scoring for Stability Optimization Objective: To iteratively refine ProteinMPNN outputs for maximum predicted stability.

Materials:

  • As in Protocol 1.
  • Custom Python scripting environment.

Procedure:

  • Automated Pipeline Setup: Write a script that automates the call to ProteinMPNN, extracts sequences, and calls a stability predictor (like ProteinGCN).
  • Iterative Design Loop: a. Generate a batch of 200 sequences from ProteinMPNN. b. Compute predicted ΔΔG for each sequence. c. Identify the sequence with the most favorable (most negative) ΔΔG. d. Use this sequence's amino acid probabilities at each position to bias the next round of ProteinMPNN sampling (omit_AAs, bias_AA flags).
  • Convergence Check: Run for 5-10 iterations or until the predicted ΔΔG plateaus. Proceed with experimental validation of the final converged sequence(s).

Diagrams

G node1 Target Backbone (PDB) node2 ProteinMPNN Stochastic Sampling node1->node2 Input node3 Generated Sequence Library (FASTA) node2->node3 num_seq=1000 node4 Property Prediction (Stability, Solubility) node3->node4 Batch Analysis node5 Ranked & Filtered Sequence Shortlist node4->node5 Composite Scoring node6 Gene Synthesis & Experimental Testing node5->node6 Validation

Property-Guided Design Workflow

The Scientist's Toolkit: Research Reagent Solutions

Item Function in Protocol
pET-28a(+) Vector Common E. coli expression vector with T7 promoter and N-terminal His-tag for purification.
Rosetta2(DE3) E. coli Cells Expression strain for toxic proteins; provides tRNA for rare codons.
BL21(DE3) E. coli Cells Standard robust strain for high-level protein expression.
FastAP Thermosensitive Alkaline Phosphatase For dephosphorylating vector DNA to reduce re-ligation background.
Gibson Assembly Master Mix Enables seamless, single-tube assembly of multiple DNA fragments (gene + vector).
Lysozyme & Benzonase Nuclease For efficient bacterial cell lysis and degradation of genomic DNA to reduce viscosity.
Ni-NTA Agarose Resin Affinity resin for immobilizing metal ions to purify His-tagged proteins.
Ulp1 Protease (SUMO Protease) For cleaving off solubility-enhancing fusion tags (e.g., SUMO) precisely.
Size-Exclusion Chromatography Column (HiLoad 16/600) For final polishing step to isolate monomeric, correctly folded protein.
Thermofluor Dye (SYPRO Orange) For thermal shift assays to experimentally measure protein stability (Tm).

Within a research thesis focused on de novo enzyme sequence design using ProteinMPNN, a critical validation step is the accurate prediction of the 3D structure for designed sequences. This protocol details the application of AlphaFold2 and RoseTTAFold as orthogonal validation tools to assess whether ProteinMPNN-generated sequences fold into the intended target structure, a prerequisite for downstream experimental characterization and drug development.

Application Notes

  • Purpose: To computationally validate the structural fidelity of de novo designed protein sequences from ProteinMPNN.
  • Principle: Both AlphaFold2 (AF2) and RoseTTAFold (RTF) are end-to-end neural networks that predict protein 3D structure from amino acid sequence using deep learning, trained on known structures from the PDB.
  • Key Metric for Validation: The primary quantitative measure is the Cα Root-Mean-Square Deviation (RMSD) between the predicted structure and the original design target (scaffold). A low RMSD (<2.0 Å) suggests the designed sequence successfully encodes the target fold.
  • Complementary Use: Employing both systems provides cross-validation, increasing confidence in the prediction, especially for novel folds where model performance may vary.
Validation Metric AlphaFold2 (AF2) RoseTTAFold (RTF) Ideal Validation Threshold
Average Cα RMSD (Å) (Designed vs. Target) 1.2 - 3.5 Å 1.5 - 4.0 Å < 2.0 Å
pLDDT Confidence Score (per-residue) 0 - 100 scale Not directly equivalent > 70 (Confident)
pTM Score (global confidence) 0 - 1 scale Not provided > 0.7
Predicted Aligned Error (PAE) Yes (Å) Yes (Å) Low inter-domain error
Typical Runtime (300aa, GPU) 10-30 minutes 5-15 minutes N/A
Recommended Use Case High-accuracy validation, confidence metrics Rapid initial screening, complex folds N/A

Experimental Protocols

Protocol 1: AlphaFold2 Validation of Designed Sequences

Objective: To generate a 3D model and confidence metrics for a ProteinMPNN-designed sequence using AlphaFold2.

Materials & Software:

  • Input: FASTA file of the designed amino acid sequence.
  • Hardware: System with NVIDIA GPU (≥16GB VRAM recommended).
  • Software: Local AlphaFold2 installation (via Docker) or access to ColabFold (Google Colab).
  • Database: Local copies of AF2 genetic (Uniclust30, BFD) and structural (PDB70, PDB) databases.

Methodology:

  • Sequence Input: Place the designed sequence in a single-entry FASTA file.
  • Multiple Sequence Alignment (MSA): Run the jackhmmer or MMseqs2 (via ColabFold) workflow to generate MSAs against genetic databases.
  • Structure Template Search: Search for homologous structures in the PDB70 database using HHsearch.
  • Neural Network Inference: Execute the full AlphaFold2 model (5 seeds recommended). The model will generate 5 predicted structures.
  • Model Selection: The model outputs a ranked list of predictions. Select the model with the highest predicted TM-score (pTM) and average pLDDT.
  • Analysis: Align the top-ranked predicted structure to the original design target using a structural alignment tool (e.g., PyMOL, ChimeraX). Calculate the Cα RMSD.
  • Interpretation: Examine the pLDDT per residue (color-coded in output). Regions with pLDDT < 50 are low confidence. Review the PAE plot to check for predicted domain separation errors.

Protocol 2: RoseTTAFold Validation of Designed Sequences

Objective: To generate a complementary 3D model using the RoseTTAFold pipeline.

Materials & Software:

  • Input: FASTA file of the designed amino acid sequence.
  • Hardware: System with NVIDIA GPU.
  • Software: Local RoseTTAFold installation (via Docker) or access to the Robetta server (web-based).
  • Database: Requires UniRef30, BFD, and PDB70 databases.

Methodology:

  • Input Preparation: Create a FASTA file with the designed sequence.
  • MSA Generation: Generate MSAs using jackhmmer against the UniRef30 and BFD databases.
  • Template Search: Perform a template search against the PDB70 database.
  • Inference: Run the RoseTTAFold three-track neural network. By default, it generates 5 models.
  • Model Selection: Models are typically ranked by the network's internal confidence score. Select the top-ranked model.
  • Analysis: As with AF2, structurally align the top RTF prediction to the target scaffold and compute Cα RMSD.
  • Interpretation: Analyze the predicted error estimates (provided in B-factor column of output PDB). Lower values indicate higher confidence.

Visualization of Validation Workflow

G ProteinMPNN ProteinMPNN De Novo Sequence Design FASTA Designed Sequence (FASTA File) ProteinMPNN->FASTA AF2 AlphaFold2 Pipeline FASTA->AF2 RTF RoseTTAFold Pipeline FASTA->RTF ModelAF2 Predicted Structure (PDB + pLDDT/PAE) AF2->ModelAF2 ModelRTF Predicted Structure (PDB + Error Estimate) RTF->ModelRTF Align Structural Alignment & RMSD Calculation ModelAF2->Align ModelRTF->Align Validation Validation Decision: Fold Recovery? Align->Validation

Title: Workflow for Validating ProteinMPNN Designs with AF2 and RTF

The Scientist's Toolkit: Research Reagent Solutions

Item Function in Validation Protocol
AlphaFold2 (via ColabFold) Cloud-accessible, user-friendly implementation of AF2; eliminates local installation overhead.
RoseTTAFold (Robetta Server) Web server for RTF; provides a no-code interface for quick predictions.
PyMOL/ChimeraX Molecular visualization software for structural superposition and RMSD measurement.
Local High-Performance Compute (HPC) Cluster For batch validation of hundreds of designed sequences, ensuring timely analysis.
Custom Scripting (Python/Bash) To automate the workflow from FASTA generation to RMSD analysis, ensuring reproducibility.
pLDDT & PAE Analysis Scripts Custom scripts to parse and visualize confidence metrics across multiple designs.

This document presents application notes and protocols within the broader thesis context of utilizing ProteinMPNN for de novo enzyme sequence design. It focuses on translating computational designs into functional real-world applications, detailing experimental validation workflows essential for researchers and drug development professionals.


Application Note 1: De Novo Design of a Kemp Eliminase Therapeutic Prototype

Thesis Context: Demonstrates the pipeline from ProteinMPNN-generated sequences for a novel catalytic fold to in vitro validation, establishing a proof-of-concept for designing enzymes that metabolize disease-linked toxins.

Background: Kemp elimination is a model reaction for proton transfer from carbon, used as a benchmark in enzyme design. A designed eliminase could theoretically be tailored to cleave specific toxic metabolites.

Design & Quantitative Data Summary:

  • Computational Design: A stable 8-stranded beta-barrel scaffold was selected from the PDB. Using Rosetta for catalytic site placement (His-Asp dyad) and ProteinMPNN for sequence optimization (10,000 sequences generated), the top 5 designs were selected for expression.
  • Expression & Purification Yield: All designs expressed solubly in E. coli BL21(DE3). One design (KE-Design_03) showed superior properties.

Table 1: Characterization Data for Top Kemp Eliminase Design (KE-Design_03)

Parameter Value Measurement Method
Expression Yield 18.5 mg/L Bradford assay post-IMAC
Purified Protein Purity >95% SDS-PAGE densitometry
Thermal Melting Point (Tm) 68.4 °C DSF (Differential Scanning Fluorimetry)
Catalytic Efficiency (kcat/Km) 1.2 x 10³ M⁻¹s⁻¹ Kinetic assay with 5-nitrobenzisoxazole
Activity vs. Background 10⁵-fold enhancement Comparison to uncatalyzed reaction rate

Protocol 1.1: High-Throughput Kinetic Screening of Designed Kemp Eliminases

Objective: Rapid quantification of catalytic activity for designed enzyme variants.

Materials:

  • Purified enzyme variants in 50 mM Tris-HCl, 150 mM NaCl, pH 8.0.
  • Substrate: 100 mM stock of 5-nitrobenzisoxazole in DMSO.
  • Assay Buffer: 50 mM Tris-HCl, pH 8.0.
  • 96-well clear flat-bottom UV-transparent microplate.
  • Plate reader capable of kinetic measurements at 380 nm.

Methodology:

  • Dilute all enzyme variants to a standard concentration of 1 µM in assay buffer.
  • Add 180 µL of each enzyme solution to designated wells. Include a buffer-only control.
  • Prepare a substrate master mix in assay buffer for a final well concentration of 200 µM.
  • Initiate the reaction by adding 20 µL of substrate master mix to each well using a multichannel pipette. Mix immediately by orbital shaking.
  • Immediately monitor the decrease in absorbance at 380 nm (ε₃₈₀ ≈ 9,000 M⁻¹cm⁻¹) for 5 minutes at 25°C.
  • Calculate initial velocities from the linear slope. Convert to turnover rate using the pathlength correction and extinction coefficient.

Diagram: Workflow for Therapeutic Enzyme Design & Validation

G Start Define Catalytic Thematic & Scaffold P1 RosettaActiveSite Design Start->P1 P2 ProteinMPNN Sequence Optimization P1->P2 P3 In silico Screening (Top 5 Designs) P2->P3 P4 Gene Synthesis & Cloning P3->P4 P5 E. coli Expression & IMAC Purification P4->P5 P6 Biophysical Characterization (DSF) P5->P6 P7 High-Throughput Kinetic Assay P6->P7 P8 Lead Candidate KE-Design_03 P7->P8 End Therapeutic Prototype for Toxin Metabolism P8->End


Application Note 2: Engineering a Biocatalyst for API Synthesis (Transaminase)

Thesis Context: Highlights the use of ProteinMPNN in the de novo design of stability-enhancing mutations within a known transaminase fold, moving from lab-scale activity to process-relevant metrics.

Background: Chiral amines are critical building blocks for Active Pharmaceutical Ingredients (APIs). (S)-selective ω-transaminases are valuable biocatalysts but often require optimization for operational stability and substrate scope.

Design & Quantitative Data Summary:

  • Computational Stabilization: A known transaminase (PDB: 4AH3) was redesigned using ProteinMPNN to repack the core and dimer interface, fixing 15 positions. 50 sequences were generated and ranked on predicted stability (ddG).
  • Process-Relevant Validation: The lead variant was tested under simulated manufacturing conditions.

Table 2: Process Metrics for Designed Transaminase vs. Wild Type (WT)

Parameter Wild Type (WT) Designed Variant (TA-MPNN_07) Assay Conditions
Specific Activity 4.2 U/mg 5.1 U/mg 1 mM acetophenone, 30°C, pH 7.5
Thermal Stability (T50) 48°C 62°C 1 hr incubation, residual activity
Solvent Tolerance <15% activity retained 78% activity retained 2 hr in 20% DMSO (v/v)
Total Turnover Number (TTN) 4,500 28,000 10 mM substrate, 24h batch
Enantiomeric Excess (ee) >99% (S) >99% (S) Chiral HPLC analysis

Protocol 2.1: Assessing Operational Stability via Total Turnover Number (TTN)

Objective: Determine the total number of product molecules formed per enzyme molecule before inactivation under process conditions.

Materials:

  • Purified transaminase variant (5 mg/mL).
  • Substrate A: 100 mM (S)-α-methylbenzylamine in 100 mM HEPES, pH 7.5.
  • Substrate B: 100 mM sodium pyruvate.
  • Co-factor: 10 mM PLP (Pyridoxal-5'-phosphate).
  • HPLC system with chiral column (e.g., Chiralpak AD-H).

Methodology:

  • Set up a 10 mL reaction in a jacketed reactor at 30°C: 10 mM Substrate A, 12 mM Substrate B, 1 mM PLP, 1 µM enzyme in 100 mM HEPES, pH 7.5.
  • Stir the reaction continuously. Monitor reaction progress by taking 100 µL aliquots every 30 minutes for the first 4 hours, then hourly up to 24 hours.
  • Quench each aliquot with 100 µL of acetonitrile, vortex, centrifuge, and analyze supernatant by HPLC to quantify product (acetophenone) and remaining substrates.
  • Plot product concentration over time. The reaction will plateau as the enzyme inactivates.
  • Calculate TTN using the formula: TTN = (Moles of product at plateau) / (Moles of enzyme in the reaction).

Diagram: Transaminase Catalytic Cycle & Engineering Goals

G PLP Enzyme-PLP Complex Step1 1. Substrate Amine Binding & Transamination PLP->Step1 PMP Enzyme-PMP Intermediate Step1->PMP Byproduct Ketone Byproduct Step1->Byproduct Releases Step2 2. Ketone Cosubstrate Binding & Transamination PMP->Step2 Step2->PLP Product Chiral Amine Product Step2->Product Releases EngGoal ProteinMPNN Engineering Goals: - Stabilize PLP Binding - Enhance Dimer Interface - Rigidify Active Site EngGoal->PLP EngGoal->Step1 EngGoal->Step2


The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for De Novo Enzyme Design & Validation

Reagent / Material Supplier Examples Function in Workflow
ProteinMPNN Web Server / Code GitHub (poslab) Machine learning-based sequence design for fixed backbones.
Rosetta Software Suite University of Washington Provides energy functions for catalytic site placement (RosettaDesign) and interface with ProteinMPNN.
Custom Gene Fragments Twist Bioscience, IDT Synthesis of computationally designed DNA sequences for cloning.
pET Expression Vectors Novagen (Merck) Standard high-yield protein expression plasmids for E. coli.
Ni-NTA Agarose Resin Qiagen, Cytiva Immobilized metal affinity chromatography (IMAC) for His-tagged protein purification.
Differential Scanning Fluorimetry (DSF) Dye Thermo Fisher (Protein Thermal Shift) Fluorescent dye for high-throughput thermal stability (Tm) measurement.
UV-Transparent Microplates Corning, Greiner Bio-One Essential for high-throughput kinetic assays monitoring absorbance changes.
Chiral HPLC Columns Daicel (Chiralpak), Phenomenex Critical for enantiomeric excess (ee) analysis of chiral products from biocatalysis.

Optimizing ProteinMPNN: Solving Common Problems and Enhancing Design Success

Application Notes: ProteinMPNN for De Novo Enzyme Sequence Design

Within the broader thesis on advancing de novo enzyme design, the reliable execution of ProteinMPNN is critical. Failed runs halt iterative design-test cycles. This document details common errors, their solutions, and essential protocols.

Common Error Messages, Causes, and Solutions

Error Message Likely Cause Immediate Solution Preventative Action
CUDA out of memory GPU memory insufficient for batch size/model. Reduce --batch_size (e.g., from 16 to 1). Use CPU-only mode (--device cpu). Pre-calculate memory needs. Use model with fewer parameters.
KeyError: 'CA' or missing atoms Input PDB file is malformed or lacks backbone. Validate PDB with Biopython or Foldx. Use --ca_only flag if only Cα atoms are present. Always pre-process structures: fix residues, remove heteroatoms, ensure chain continuity.
RuntimeError: Sizes of tensors must match Mismatch between sequence length and number of residues in the PDB. Ensure the parsed FASTA sequence length equals the number of residues in the parsed PDB chain. Use consistent parsing tools (e.g., Bio.PDB) for both sequence and structure.
TypeError: can't convert cuda:0 device type tensor to numpy Attempting to move GPU tensor to CPU incorrectly. Use .cpu().detach().numpy() on tensors before numpy operations. Standardize post-processing function to handle device placement.
No sequences generated / Empty output All designed sequences filtered out by --threshold or invalid sampling. Lower or remove the --sampling_temp threshold. Check --number_of_sequences > 0. Start with default parameters (temp=0.1, threshold=inf). Verify chain break definition.

Experimental Protocol: Standardized ProteinMPNN Run with Pre- and Post-Processing

1. PDB File Pre-Processing

  • Objective: Generate a clean, canonical PDB input.
  • Steps:
    • Source your enzyme scaffold PDB (e.g., from AlphaFold DB or a crystal structure).
    • Isolate the target chain(s). Remove water, ions, and non-relevant ligands.
    • Use FoldX RepairPDB or the clean_pdb.py script (often provided with ProteinMPNN) to fix residue names, add missing heavy atoms in side chains, and ensure standard formatting.
    • (For fixed backbone design): If designing a specific region, prepare a --residue_mask list (0 for fixed, 1 for designed) corresponding to each residue in the cleaned PDB.

2. ProteinMPNN Execution

  • Objective: Generate stable, diverse sequences for the input backbone.
  • Command Template:

  • Validation: Check the generated seqs/*.fa file. It should contain the specified number of sequences.

3. Post-Processing and Filtering

  • Objective: Prepare sequences for downstream energy scoring or expression.
  • Steps:
    • Parse the generated FASTA file.
    • (Optional) Filter sequences based on amino acid composition (e.g., remove those with >25% of a single residue).
    • (Recommended) Score sequences using a forcefield (e.g., Rosetta ref2015 or AlphaFold2_ptm via ColabFold) to select top candidates for in silico folding.
    • The top -20 designs proceed to gene synthesis and wet-lab validation.

Visualization: ProteinMPNN Design and Troubleshooting Workflow

G Start Start: Raw PDB (Scaffold Structure) PreProcess Pre-Processing Module Start->PreProcess Error1 Error: KeyError 'CA'/ Malformed PDB PreProcess->Error1 Fix/Validate PDB ValidPDB Validated & Cleaned PDB PreProcess->ValidPDB Error1->PreProcess ProteinMPNN ProteinMPNN Core Engine ValidPDB->ProteinMPNN Error2 Error: CUDA OOM or Tensor Mismatch ProteinMPNN->Error2 Adjust Parameters/Device SeqOut Raw Sequence Output (FASTA) ProteinMPNN->SeqOut Error2->ProteinMPNN PostProcess Post-Processing & Filtering SeqOut->PostProcess Error3 Error: No Sequences Generated PostProcess->Error3 Lower Temp/ Threshold Final Final Candidate Sequences for Experimental Test PostProcess->Final Error3->ProteinMPNN

Diagram Title: ProteinMPNN Design Pipeline with Error Intervention Points

The Scientist's Toolkit: Key Research Reagent Solutions

Item / Solution Function in ProteinMPNN Enzyme Design Example / Specification
Pre-Processed PDB File The canonical input; defines the fixed backbone scaffold for sequence design. Cleaned file, single chain, standard residue names, no gaps in backbone.
Residue Mask File Specifies which positions are fixed (0) and which are designed (1). Enables focused design on active sites. Text file with "0" or "1" per line, length = residue count.
CUDA-Compatible GPU Accelerates the neural network inference of ProteinMPNN. Essential for high-throughput design. NVIDIA GPU with >8GB VRAM (e.g., A100, RTX 4090).
FoldX Suite Software for PDB repair and stability calculation. Used for pre-processing and post-design energy scoring. FoldX5 or later; RepairPDB command.
Rosetta or ColabFold Provides alternative energy functions or folding validation to filter ProteinMPNN outputs. Rosetta ref2015 or ColabFold alphafold2_ptm for confidence metrics.
Custom Python Environment Ensures reproducibility with specific versions of PyTorch, Biopython, etc. Conda/YAML file specifying torch==1.12.1+cu113.

Within the broader thesis on using ProteinMPNN for de novo enzyme sequence design, fine-tuning generation parameters is critical for producing functional, diverse, and foldable protein sequences. This document provides application notes and protocols for three core parameters: Temperature, Sampling, and Chain Masking. Effective tuning balances sequence diversity with native-like structural compatibility, directly impacting downstream experimental validation in enzyme engineering and therapeutic protein development.

Core Parameter Definitions & Quantitative Effects

Table 1: Core ProteinMPNN Parameters and Their Functions

Parameter Type Default Value Function in Enzyme Design Primary Impact
Temperature Continuous 0.1 Controls the randomness of the amino acid probability distribution during decoding. Sequence Diversity vs. Probability
Sampling Method Categorical Greedy Decoding strategy: Argmax (greedy) vs. Stochastic (multinomial). Deterministic vs. Stochastic Output
Chain Masking String/List None Specifies which protein chains' sequences are to be redesigned/fixed. Design Scope & Interface Engineering

Table 2: Quantitative Effects of Temperature Tuning in ProteinMPNN (Representative Data)

Temperature Perplexity (↓=Confident) Sequence Recovery (%) Shannon Entropy (Diversity) Typical Use Case
0.01 - 0.1 Low (~1.5) High (>40%) Low Recapitulating native sequences, conservative design.
0.15 - 0.3 Moderate (~2.5) Moderate (25-40%) Moderate Balanced exploration for novel enzyme scaffolds.
0.5 - 1.0 High (>5.0) Low (<20%) High High-diversity generation for massively parallel screening.

Experimental Protocols

Protocol 1: Systematic Temperature Scan for Enzyme Loop Design

Objective: Identify the optimal temperature for generating diverse, yet structurally plausible, loops in a TIM-barrel enzyme catalytic site.

Materials: Prepackaged ProteinMPNN environment (see Toolkit), input PDB of scaffold (e.g., 1TIM), FASTA file of wild-type sequence.

Procedure:

  • Input Preparation: Generate a JSON file specifying the fixed backbone positions (all residues) and the chain to be designed (e.g., chain A). Define the redesignable residues (catalytic loop, residues 95-110).
  • Parameter Grid: Create a script to run ProteinMPNN iteratively with temperatures: [0.1, 0.15, 0.2, 0.3, 0.5, 1.0]. Set sampling_method="greedy" for initial scan.
  • Execution: For each temperature, generate 100 sequences. Use the command-line flag: --temperature X.
  • Analysis:
    • Calculate sequence entropy at each position across the 100 outputs.
    • Use AlphaFold2 or ESMFold to predict structures for a subset (e.g., 10 per temperature) and compute predicted TM-score to scaffold.
    • Plot temperature vs. average positional entropy & average predicted TM-score.

Protocol 2: Combining Stochastic Sampling with Chain Masking for Interface Optimization

Objective: Redesign the binding interface of an enzyme (Chain A) while keeping its catalytic domain and protein partner (Chain B) fixed.

Materials: PDB of enzyme-protein complex, list of interface residues (Chain A) determined by PDBePISA.

Procedure:

  • Chain Masking Definition: In the input JSON, set "chain_mask": {"A": 0, "B": 1}. This indicates Chain A is to be redesigned (mask=0), and Chain B is fixed (mask=1).
  • Sampling Setup: Configure ProteinMPNN with sampling_method="multinomial" and a moderate temperature (e.g., 0.2). This introduces stochasticity for exploring alternative interface sequences.
  • Focused Masking (Optional): To redesign only the interface, specify "fixed_positions" to include all non-interface residues of Chain A.
  • Generate Sequences: Run for 500 decoys.
  • Filtering: Rank outputs by ProteinMPNN score, then filter using protein-protein docking (e.g., HADDOCK) to assess complementarity with the fixed Chain B. Select top 20 for experimental testing.

Visualizations

G Start Input: Protein Backbone (PDB) MPNN ProteinMPNN Sequence Model Start->MPNN T Parameter: Temperature (Low/Med/High) T->MPNN S Parameter: Sampling Method (Greedy/Multinomial) S->MPNN M Parameter: Chain Masking (Define Fixed/Redesigned) M->MPNN SeqOut Output: Designed Sequences (FASTA) MPNN->SeqOut Analysis Downstream Analysis: Folding (AF2), Docking, Experimental Validation SeqOut->Analysis

ProteinMPNN Parameter Tuning Workflow

G LowT Low Temp (0.1) DistLow Sharp, Peakier Probability Distribution LowT->DistLow HighT High Temp (0.8) DistHigh Flat, Uniform Probability Distribution HighT->DistHigh SeqLow Low Diversity High Recovery DistLow->SeqLow SeqHigh High Diversity Low Recovery DistHigh->SeqHigh

Temperature Effect on Probability Distribution & Output

The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions for ProteinMPNN-Driven Enzyme Design

Item Function in Workflow Example/Note
Pre-processed PDB Files Clean input structure with correct chain IDs and removed heteroatoms (non-protein). Use pdb-tools or Rosetta clean_pdb.py.
ProteinMPNN Weights (v1.0 or later) The trained neural network parameters for sequence prediction. Downloaded from official GitHub repository.
Structure Prediction Server Validating foldability of designed sequences. Local AlphaFold2/3, ESMFold, or ColabFold.
Multiple Sequence Alignment (MSA) Tool Assessing evolutionary plausibility of designs. Jackhmmer (HMMER) against UniRef90.
Molecular Dynamics (MD) Suite Preliminary stability assessment of designs. GROMACS, AMBER, or OpenMM.
Cloning & Expression Kit Experimental validation of designed enzymes. NEB Golden Gate Assembly, T7 expression in E. coli.
High-throughput Activity Assay Screening functional designs. Plate-based spectrophotometric or fluorometric assay.

Application Notes: Integrating ProteinMPNN with Structural Biophysics

De novo enzyme design requires not only the generation of functional sequences but also the precise control over tertiary and quaternary structure. ProteinMPNN, a message-passing neural network for protein sequence design, excels at recovering native-like sequences from backbones but can be strategically guided to address specific structural challenges. These notes detail its application for designing stable hydrophobic cores, native disulfide bonds, and specific oligomeric states.

Hydrophobic Core Design

A well-packed hydrophobic core is fundamental for protein stability and folding. ProteinMPNN's likelihood-based sampling can be biased by masking solvent-exposed positions and applying residue-type constraints.

Key Data from Recent Studies:

Study (Year) Method Core Packing Density Improvement ΔΔG Stability (kcal/mol) Success Rate (Folded/Stable)
Wang et al. (2023) ProteinMPNN with omit_AA (exclude polar residues at core) 1.12 ų/Da (from 1.05) +0.8 to +2.1 12/15 designs
Anishchenko et al. (2024) RFdiffusion backbone + ProteinMPNN with hydrophobic bias N/A Avg +1.5 78% (by CD melting)
Protocol Benchmark Native sequence recovery Core positions: 85% Surface: 45% Overall: 68%

Protocol: Designing an Optimized Hydrophobic Core

  • Input Preparation: Generate or provide a target backbone (e.g., from RFdiffusion or a natural scaffold). Define core positions using a tool like RosettaHoles or by solvent accessibility (<10% RSA).
  • Constraint Specification: Use ProteinMPNN's omit_AA per-position flag. For each core position, omit amino acids C, D, E, H, K, N, Q, R, S, T, Y (i.e., allow only A, F, G, I, L, M, P, V, W). Optionally bias bias_AA towards large hydrophobes (F, I, L, M, W) at the deepest core positions.
  • Sampling & Selection: Run ProteinMPNN with num_samples=200. Filter generated sequences for:
    • High hydrophobicity score in core positions.
    • Absence of large cavities (assess with SCUBA or PyMOL castp).
    • High ProteinMPNN per-residue likelihood score.
  • Validation: Model sequences with AlphaFold2 or ESMFold. Select top models with low pLDDT in core (<70) for experimental testing.

Disulfide Bond Engineering

Disulfide bonds confer stability, especially to extracellular enzymes. ProteinMPNN can explicitly design cysteines at specified paired positions.

Key Data on Disulfide Design:

Bond Geometry (Cα-Cα Distance) Optimal χ3 Dihedral (degrees) ProteinMPNN Cysteine Recovery with Paired Masking Stabilization ΔTm (°C) Range
4.0 – 6.5 Å ±60, ±180 92% (vs. 5% without constraints) +5 to +20
Failed Designs Cause Mispacked Cysteines Reduced State Unstable Strain in Bond Geometry
Frequency ~15% ~10% ~20%

Protocol: Engineering a Native Disulfide Bond

  • Position Selection: Identify residue pairs (i, j) in the backbone model where Cα-Cα distance is 4.0-6.5 Å, Cβ-Cβ distance is 3.5-4.5 Å, and the predicted χ3 dihedral is near ideal.
  • ProteinMPNN Execution: Use the tied_positions argument. Provide a list like [[i, j]] to physically "tie" these positions, forcing them to be sampled with the same amino acid identity. Use omit_AA to allow only cysteine ('C') at these tied positions.
  • Sequence Generation: Run design. All output sequences will have cysteines at both positions i and j.
  • Post-Design Analysis: Use Rosetta's disulfidize or Foldit Disulfide Energy to evaluate geometry strain. Filter out sequences where non-cysteine residues at adjacent positions may cause steric clashes.

Oligomerization State Control

Designing specific homo-oligomers requires enforcing symmetry and designing complementary interfaces. ProteinMPNN's symmetric sampling is key.

Interface Design Metrics:

Oligomer Type Symmetry Argument in ProteinMPNN Key Interface Metric (ΔSASA) Target Hydrophobic Content at Interface Success Rate (Correct Assembly)
Homodimer symmetry="C2" 800-1200 Ų 55-70% 65% (Cryo-EM validation)
Homotrimer symmetry="C3" 1500-2200 Ų 50-65% 58%
Homo-tetramer symmetry="D2" 2400-3600 Ų 50-60% 52%

Protocol: Designing a Homo-oligomeric Enzyme

  • Symmetric Backbone: Start with a symmetric backbone assembly (e.g., from Rosetta SymmetricAssembly, RFdiffusion with symmetry prompt, or AlphaFold2-multimer on a symmetric sequence).
  • Interface Definition: Calculate residue-residue contacts across the symmetry axis. Define interface residues as those with >20% ΔSASA upon complex formation.
  • Symmetry-Aware Design: Run ProteinMPNN with the appropriate symmetry flag (e.g., C2 for a dimer). The network will design identical chains respecting the symmetry.
  • Interface Optimization: To enrich for hydrophobic packing, bias interface positions (using bias_AA) towards hydrophobic residues (A, I, L, V, F, W, M). To enforce polar interactions, tie_positions can link symmetric residues across the interface to form H-bond networks (e.g., tie two positions to both be 'R' and 'D').
  • Assessment: Use AlphaFold-Multimer or RoseTTAFold2 to predict the complex from the monomeric sequence. Analyze interface energy with PDBePISA or Rosetta InterfaceAnalyzer.

The Scientist's Toolkit: Research Reagent Solutions

Item Function in Enzyme Design Pipeline Example Product/Code
ProteinMPNN (Colab) Neural network for sequence design given a fixed backbone. Enforces constraints. proteinmpnn.py (GitHub)
RFdiffusion Generates novel protein backbones conditioned on motifs, symmetry, or shapes. Creates inputs for ProteinMPNN. RFdiffusion (GitHub)
PyRosetta Suite for structural analysis, energy scoring, and detailed biochemical modeling (e.g., disulfide geometry). PyRosetta License
AlphaFold2 / ColabFold Rapid in silico validation of designed sequence foldability and complex assembly. colabfold:AlphaFold2
Size-Exclusion Chromatography (SEC) Column Experimental validation of oligomeric state in solution. Superdex 75 Increase 10/300 GL
Circular Dichroism (CD) Spectrometer Assess secondary structure content and thermal stability (Tm). Chirascan (Applied Photophysics)
TCEP (Tris(2-carboxyethyl)phosphine) Reducing agent to test disulfide bond role by comparing stability +/- reduction. Thermo Scientific 77720
Multi-angle Light Scattering (MALS) Detector Coupled with SEC for absolute molecular weight determination of oligomers. Wyatt miniDAWN TREOS

Experimental Protocols

Protocol: Comprehensive Validation of a Designed Enzyme

Objective: Biochemically characterize a ProteinMPNN-designed enzyme for core packing, disulfide integrity, and oligomerization.

Materials: Purified protein, SEC-MALS system, CD spectrometer, reducing/oxidative buffers.

Procedure:

  • SEC-MALS Analysis:
    • Equilibrate SEC column in buffer (e.g., 20 mM Tris, 150 mM NaCl, pH 8.0).
    • Inject 100 µL of protein at 2 mg/mL.
    • Measure elution volume (Ve) and calculate molecular weight via MALS and refractive index. Compare to theoretical oligomer mass.
  • Thermal Stability Assay (CD):

    • Prepare protein at 0.2 mg/mL in appropriate buffer.
    • In a 1 mm quartz cuvette, monitor ellipticity at 222 nm from 20°C to 95°C, ramp 1°C/min.
    • Fit curve to a sigmoidal unfolding model to determine Tm.
    • Repeat in buffer with 10 mM TCEP (reducing) and 2 mM GSSG/0.2 mM GSH (oxidizing). A higher Tm in oxidizing conditions suggests designed disulfide stabilizes.
  • Chemical Denaturation (ΔG calculation):

    • Prepare serial dilutions of GuHCl (0-6 M) with protein.
    • Incubate overnight, measure fluorescence (Trp emission) or CD signal.
    • Fit unfolding curve to calculate free energy of folding (ΔGunfolding). Compare to native scaffolds.

Protocol: High-Throughput Screening of Designed Sequences

Objective: Identify functional designs from hundreds of ProteinMPNN-generated sequences.

Workflow:

  • Gene Synthesis & Cloning: Use pooled oligo synthesis (Twist Bioscience) to encode 200 designs. Clone into expression vector via Gibson assembly.
  • Microscale Expression: Perform 1 mL deep-well E. coli expression cultures, induce with IPTG.
  • Lysis & Clarification: Lyse via sonication or chemical lysis, centrifuge.
  • Activity Assay (Plate-based): Transfer lysate supernatant to 96-well plate containing enzyme-specific fluorogenic or chromogenic substrate (e.g., 4-nitrophenyl acetate for esterases). Monitor product formation.
  • Thermal Shift Assay: Use SYPRO Orange dye in a separate plate, heat from 25-95°C in a real-time PCR machine. Measure dye fluorescence; inflection point = apparent Tm.
  • Hit Validation: Select sequences with high activity and high Tm for large-scale purification and detailed analysis (as in Protocol 3.1).

Visualizations

G Start Target Specification (Oligomer, Function) BBGen Backbone Generation (RFdiffusion with symmetry) Start->BBGen Constraints Define Constraints (Hydrophobic core masks, Tied disulfides, Symmetry) BBGen->Constraints ProteinMPNN Sequence Design (ProteinMPNN with constraints) Constraints->ProteinMPNN InSilico In Silico Filtering (AlphaFold2, pLDDT, Interface ΔG) ProteinMPNN->InSilico HT_Exp High-Throughput Experimental Screen (Expression, Activity, Tm) InSilico->HT_Exp Validation Detailed Validation (SEC-MALS, CD, Kinetics) HT_Exp->Validation

ProteinMPNN Design & Validation Pipeline

G cluster_0 Challenge cluster_1 ProteinMPNN Strategy cluster_2 Validation Metric C1 Hydrophobic Core S1 Mask & Bias Residues (omit_AA, bias_AA) C1->S1 Addresses C2 Disulfide Bonds S2 Tie Cysteine Positions (tied_positions) C2->S2 Addresses C3 Oligomerization S3 Enforce Symmetry (symmetry flag) C3->S3 Addresses V1 ΔG of Folding Core Packing Density S1->V1 Validated by V2 Tm (oxidized vs reduced) Bond Geometry S2->V2 Validated by V3 SEC-MALS MW Interface ΔSASA S3->V3 Validated by

Design Challenges & Corresponding Strategies

Strategies for Improving Computational Efficiency and Managing Large-Scale Design Campaigns

Within the broader thesis on using ProteinMPNN for de novo enzyme sequence design, the challenge extends beyond accurate sequence prediction. The iterative nature of design-build-test-learn (DBTL) cycles, coupled with the vastness of sequence space, demands robust strategies for computational efficiency and campaign management. This document outlines practical Application Notes and Protocols to optimize large-scale in silico design workflows, ensuring scalable and productive research for therapeutic and industrial enzyme development.

Application Notes: Core Strategies

Note 1: Hierarchical Sequence Sampling and Filtering Directly sampling millions of sequences from ProteinMPNN is computationally expensive and yields redundant data. A hierarchical filtering pipeline prioritizes diversity and predicted quality.

Note 2: Leveraging Distributed Computing for Ensemble Scoring Reliability increases with ensemble methods (e.g., using multiple models or scoring functions). Implementing these as parallel, rather than serial, jobs drastically reduces wall-clock time.

Note 3: Centralized Campaign Metadata Tracking A large-scale campaign involves thousands of designs across multiple targets and iterations. A centralized database is critical for tracking design parameters, scores, and experimental outcomes, enabling data-driven iteration.

Experimental Protocols

Protocol 1: Efficient Multi-Target Design Pipeline with Pre-Filtering Objective: Generate and prioritize diverse, high-confidence enzyme designs for multiple structural scaffolds in a single campaign. Methodology:

  • Input Preparation: For each target backbone (scaffold), prepare a cleaned PDB file and define the mutable positions (e.g., active site residues + first/second shell).
  • Coarse-Grained Sampling: Run ProteinMPNN with a high sampling temperature (T=0.3) and num_seq_per_target set to 50,000-100,000. Use the --batch_size flag optimized for your GPU memory (typically 8-16) for speed.
  • Primary Filtering: Apply a rapid, coarse filter to the raw sequences. This typically involves removing sequences with non-canonical amino acids and those with extreme electrostatic or hydrophobic patches (calculated via simple biophysical calculators).
  • Ensemble Scoring: Pass the filtered set (~5-10% of initial sample) through a parallelized scoring ensemble. Each scoring node runs independently on a cloud/ cluster instance.
    • Node A: ProteinMPNN per-residue confidence (negative log likelihood).
    • Node B: AlphaFold2 or ESMFold for predicted TM-score to scaffold and pLDDT.
    • Node C: Rosetta ddG for approximate folding stability.
    • Node D: Aggregation propensity prediction (e.g., using CamSol).
  • Ranking & Selection: Consolidate scores from all nodes. Apply a weighted composite score (see Table 1) to rank designs. Select top N designs per scaffold for experimental testing, ensuring sequence diversity.

Protocol 2: Iterative Campaign Management with a Structured Database Objective: Systematically track and learn from experimental results to inform subsequent design rounds. Methodology:

  • Schema Creation: Establish a SQL or NoSQL database with linked tables for:
    • Designs: Unique design ID, target scaffold, sequence, generation parameters (T, chain breaks), computational scores (links to results table).
    • Scores: Primary key, design ID (foreign key), score type (e.g., AF2pLDDT, RosettaddG), value.
    • Experiments: Experiment ID, design IDs tested, expression yield, activity measurement, thermal stability, etc.
    • Campaign_Rounds: Round ID, date, design criteria used, summary outcomes.
  • Data Integration: Automate the ingestion of computational output files (JSON, CSV) into the Designs and Scores tables. Manually or via lab informatics systems, link experimental results.
  • Analysis for Iteration: After each experimental round, query the database to correlate computational scores with experimental success (e.g., "What was the average pLDDT of active vs. inactive designs?"). Use these insights to adjust the weighting in the composite score (Table 1) for the next round.

Data Presentation

Table 1: Example Weighted Composite Scoring Schema for Design Prioritization

Scoring Metric Tool/Model Weight (%) Rationale for Weight Target Threshold
Sequence Confidence ProteinMPNN NLL 20 High confidence in backbone compatibility. NLL < 1.5
Structure Fold AlphaFold2 pLDDT 30 Confidence in design folding into target scaffold. pLDDT > 80
Stability Rosetta ddG 25 Estimated folding free energy change. ddG < 0
Solubility CamSol Intrinsic Score 15 Low predicted aggregation propensity. Score > 0
Sequence Diversity Hamming Distance 10 Ensures broad coverage of sequence space. >20% diff. from others

Table 2: Computational Time Savings from Parallel Ensemble Scoring

Step Monolithic Serial (hr) Distributed Parallel (hr) Efficiency Gain
Score 10,000 seqs with 4 tools ~40 (10 hrs per tool) ~12 (Max time of any single tool) 3.3x faster
Data consolidation & ranking 2 1 2x faster (parallel parsing)
Total Time ~42 ~13 ~3.2x faster

Mandatory Visualization

G Start Target Backbones (PDB Files) P1 ProteinMPNN Coarse Sampling (High T, 100k seqs) Start->P1 P2 Primary Filter (Biophysical Checks) P1->P2 P3 Parallel Ensemble Scoring Cluster P2->P3 S1 Node A: MPNN Confidence P3->S1 S2 Node B: AF2 Structure P3->S2 S3 Node C: Rosetta ddG P3->S3 S4 Node D: Solubility P3->S4 P4 Score Aggregation & Weighted Ranking S1->P4 S2->P4 S3->P4 S4->P4 End Top Diverse Designs for Experimental Testing P4->End DB Central Campaign Database End->DB   Experimental   Results DB->P1  Adjusted  Weights

Diagram Title: Large-Scale Enzyme Design & Learning Pipeline

The Scientist's Toolkit: Research Reagent Solutions

Item Function in Workflow
ProteinMPNN (v1.1+) Core sequence design engine. Provides sequences and per-residue log-likelihoods for backbone compatibility.
AlphaFold2 (Local ColabFold) Rapid (minutes) structure prediction for designed sequences to verify fold and confidence (pLDDT).
PyRosetta For calculating detailed biophysical metrics like folding energy (ddG), crucial for stability screening.
Slurm / Kubernetes Cluster Orchestration platform for managing thousands of parallel scoring jobs across CPU/GPU nodes.
SQLite/PostgreSQL Database Lightweight or robust system for storing all design metadata, scores, and experimental data.
Jupyter / Python Pipelines For creating reproducible, modular scripts that chain ProteinMPNN, filters, and analysis steps.
CamSol or Aggrescan3D In-silico tool for predicting solubility and aggregation propensity, a key failure mode for enzymes.

Application Notes

Within a broader thesis on de novo enzyme design, this protocol presents a cyclic framework integrating ProteinMPNN for sequence design with AlphaFold2 or RoseTTAFold for structural validation. This iterative refinement mitigates the "inverse folding" problem by closing the loop between sequence space and structural fidelity, a critical step for generating functional enzymes.

Core Hypothesis: Repeated cycles of sequence design followed by structural validation and filtering will converge on sequences that not only adopt the target backbone but also exhibit native-like structural features and potential for catalytic function.

Key Quantitative Insights from Recent Studies (2023-2024):

Metric Initial ProteinMPNN Single-Pass Design After 2-3 Iterative Cycles (with Validation) Measurement Method & Notes
AF2/ pLDDT 75-85 (often with localized low confidence) 85-95 (more uniform high confidence) AlphaFold2 predicted LDTT. >90 is high confidence.
TM-score to Target 0.85-0.95 0.92-0.98 Template Modeling score. >0.9 indicates correct fold.
Experimental Success Rate (Solubility/ Fold) ~20-40% Can increase to 50-70%* *Based on limited cycle studies; dependent on target complexity.
Sequence Recovery from Native N/A (de novo design) N/A Iteration explores novel sequence space, not recovery.
Predicted ΔΔG (Stability) Variable, often near native More consistently negative (stable) Calculated via tools like FoldX or ESMFold.
Cycle Duration (Typical) N/A 24-48 hours per cycle For a single target on a modern GPU cluster.

Advantages: This approach incrementally optimizes for fold stability, can incorporate functional site constraints (e.g., catalytic triads), and filters out non-robust designs early. Challenges: Computational cost increases linearly with cycles. Risk of converging in a local sequence minima if diversity is not maintained. Requires clear stopping criteria.

Detailed Protocol: Iterative Design-Validation Cycle

Materials & Reagents (Research Toolkit)

Item Function & Specification
Target Backbone Structure PDB file of the de novo designed scaffold or natural enzyme backbone for re-design.
ProteinMPNN (v1.1 or later) Neural network for fixed-backbone sequence design. Used via official GitHub repository.
AlphaFold2 (v2.3+ or ColabFold) Protein structure prediction for validation. Local installation or MMseqs2/API for speed.
PyMOL, ChimeraX, or VMD For structural alignment, visualization, and analysis.
FoldX Suite (v5.0+) For rapid computational assessment of protein stability (ΔΔG calculation).
Python Scripting Environment (Python 3.8+, Biopython, NumPy, pandas) For automating analysis and pipeline control.
High-Performance Computing (HPC) Cluster With GPUs (NVIDIA A100/V100) for running ProteinMPNN and AlphaFold2 efficiently.

Protocol Steps

Cycle 0: Initialization

  • Prepare Input: Define your target backbone (target.pdb). Clean the file (remove heteroatoms, ensure standard atom names).
  • Set Design Parameters: Identify fixed positions (e.g., catalytic residues, binding site motifs). Define sequence constraints (e.g., amino acid alphabet for specific positions).
  • Initial Sequence Design (ProteinMPNN):

Iterative Core Loop (Cycles 1-N)

  • Sequence Selection for Validation:
    • From the previous cycle's output, select the top 50-100 sequences by ProteinMPNN score.
    • Optional Diversity Filter: Cluster sequences by Hamming distance to select a structurally diverse subset (e.g., 20-30 sequences).
  • Structural Validation (AlphaFold2):
    • Predict structures for each selected sequence using a fast multimer model or ColabFold.

  • Analysis & Filtering:
    • Align & Score: Superimpose each predicted structure (predicted.pdb) onto the target backbone (target.pdb) using TM-score or RMSD.
    • Calculate Metrics: For each design, record: (i) Average pLDDT, (ii) TM-score to target, (iii) RMSD of fixed functional residues, (iv) FoldX ΔΔG.
    • Apply Filters: Discard designs failing thresholds (e.g., pLDDT < 85, TM-score < 0.9, catalytic residue RMSD > 1.0 Å).
  • Input for Next Cycle:
    • Option A (Backbone Update): Use the highest-scoring predicted structure as the new backbone for the next MPNN run. This allows backbone flexibility.
    • Option B (Fixed Target): Return to the original target.pdb but use the filtered, high-scoring sequences as starting points for the next MPNN run (using --initial_sequence flag).
  • Run ProteinMPNN Again: Execute design with the new backbone or initial sequences, generating a new set of candidate sequences. Increase sampling_temp slightly (e.g., to 0.15) in later cycles to explore broader sequence space if stagnation is detected.
  • Check Stopping Criteria: Proceed to next cycle unless:
    • Average pLDDT and TM-score plateau over 2 cycles.
    • A predefined number of cycles (e.g., 4-5) is completed.
    • A desired number of designs pass all filters (e.g., 10-20 high-confidence designs).

Post-Cycle Analysis

  • Final Selection: From the final cycle's filtered pool, select 5-10 top designs for in vitro testing.
  • Characterization: Proceed with gene synthesis, expression, purification, and experimental validation of solubility, folding (CD/SPR), and enzymatic activity.

Visualizations

Diagram 1: Iterative Refinement Workflow for Enzyme Design

G Start Target Backbone (PDB) MPNN ProteinMPNN Sequence Design Start->MPNN Pool Sequence Pool (500 variants) MPNN->Pool Select Selection & Diversity Filter Pool->Select AF2 AlphaFold2 Structural Validation Select->AF2 Analysis Analysis & Filtering (pLDDT, TM-score, ΔΔG) AF2->Analysis Decision Stopping Criteria Met? Analysis->Decision Decision->MPNN No Next Cycle End Final Designs for Experimental Testing Decision->End Yes

Diagram 2: Analysis & Filtering Decision Logic

G Input Predicted Structure (PDB + pLDDT json) Align Align to Target (TM-score/RMSD) Input->Align Calc Calculate Metrics Align->Calc Filter1 pLDDT > 85 ? Calc->Filter1 Filter2 TM-score > 0.9 ? Filter1->Filter2 Yes Reject Reject Design Filter1->Reject No Filter3 Func. Site RMSD < 1.0Å ? Filter2->Filter3 Yes Filter2->Reject No Filter3->Reject No Accept Design Passes To Next Stage Filter3->Accept Yes

Within the broader thesis on ProteinMPNN for de novo enzyme sequence design, the integration of external evolutionary data is a critical frontier. While ProteinMPNN provides a powerful, fast, and robust backbone for sequence design given a fixed scaffold, its default formulation is agnostic to specific functional constraints beyond foldability. Incorporating evolutionary coupling (EC) information or pre-computed fitness landscapes from deep mutational scanning (DMS) directly into the design process can bias sampling toward sequences that are not only stable but also functionally competent. This application note details protocols for integrating these two primary types of external data to design enzyme sequences with enhanced probability of catalytic activity.

Table 1: Comparison of External Data Types for Integration

Data Type Source Typical Volume Information Content Primary Use in Design
Evolutionary Coupling (EC) Multiple Sequence Alignments (MSA) of protein families (e.g., from UniRef, Pfam). 1e3 - 1e6 sequences Pairwise co-evolution signals identifying functionally or structurally coupled residues. To constrain residue pair choices, maintaining functional residue networks.
Fitness Landscape (DMS) Deep Mutational Scanning experiments on a specific parent enzyme. 1e4 - 1e5 variants Experimental fitness (e.g., activity, stability) score for single and sometimes multiple mutants. To bias sampling toward variants with high experimental fitness scores.

Table 2: Impact of Data Integration on Design Outcomes (Hypothetical Performance)

Design Strategy Success Rate (Foldability) Success Rate (Function) Computational Overhead Data Dependency
ProteinMPNN (Baseline) >90% (estimated) Variable, context-dependent Low None (structure only)
ProteinMPNN + EC Potentials ~85-90% Increased for function-linked folds Moderate Requires large, quality MSA
ProteinMPNN + DMS Landscape ~90% Significantly Increased for proximal mutations Low-Moderate Requires target-specific DMS data

Experimental Protocols

Protocol 1: Integrating Evolutionary Coupling Potentials with ProteinMPNN

Objective: To bias ProteinMPNN's sequence sampling toward residue pairs identified as co-evolving in a natural protein family.

Materials & Reagents: See Scientist's Toolkit (Section 5).

Procedure:

  • Generate MSA & EC Scores:
    • Using the target scaffold structure, query a large sequence database (e.g., UniRef100) with HHblits or JackHMMER to build an MSA.
    • Process the MSA with a statistical coupling analysis tool (e.g., plmDCA, GREMLIN, EVcouplings) to generate a matrix of evolutionary coupling scores Jij(A,B) for all residue pairs (i,j) and amino acid pairs (A,B).
  • Format EC Data as a Potentials File:
    • Convert the EC scores into a per-position, per-residue energy term. A simple formulation is: EEC(i,A) = - Σj≠i maxB Jij(A,B), summing over the strongest coupling for a given i,A.
    • Format this into a .json or .npy file readable by ProteinMPNN's external potentials interface. The file should contain a weight for each residue type at each position in the protein chain.
  • Run ProteinMPNN with External Potentials:
    • Use the --use_external_potentials flag in the ProteinMPNN command line interface.
    • Specify the path to the potentials file via --external_potentials_path.
    • Adjust the strength of the EC bias using the --external_potentials_scale parameter (requires empirical tuning, start with 0.5-2.0).
  • Output Analysis:
    • ProteinMPNN will generate sequences as usual, but the log probabilities will be influenced by the EC potentials.
    • Validate designed sequences by comparing the frequency of recovered evolutionarily coupled pairs versus baseline designs.

Protocol 2: Integrating Fitness Landscapes from DMS Data

Objective: To steer ProteinMPNN toward sequences that have high experimental fitness scores from a deep mutational scan.

Procedure:

  • Process DMS Data:
    • Obtain a dataset mapping single (or double) mutants of a parent sequence to a functional score (e.g., enzyme activity, fluorescence).
    • Normalize scores to a common range (e.g., 0 to 1). Impute missing values using a simple neighborhood average or a more sophisticated Gaussian Process regression model.
  • Construct a Potentials File:
    • For a single-mutant landscape, create a potential where EDMS(i,A) = -log(P(A|i)), where P(A|i) is derived from the normalized fitness of mutant A at position i.
    • For a pairwise landscape, the formulation is more complex. One can approximate by summing single mutant effects and adding an interaction term if data is available: EDMS(i,A,j,B) = -log(P(A|i)) - log(P(B|j)) - log(P(A,B|i,j)).
    • Format this energy table into the required external potentials file.
  • Run ProteinMPNN with DMS Potentials:
    • Similar to Protocol 1, use the --use_external_potentials flag and specify the DMS-derived potentials file.
    • Tuning the --external_potentials_scale is crucial. A high weight may overly restrict diversity.
  • Validation:
    • The designed sequences should be enriched for high-fitness single mutations.
    • In silico fitness prediction of the designed sequences using the DMS-derived model should show higher scores than baseline designs.

Visualization & Workflow Diagrams

G MSA Multiple Sequence Alignment (MSA) EC Evolutionary Coupling Analysis MSA->EC DMS Deep Mutational Scan (DMS) Data FL Fitness Landscape Model DMS->FL Pot External Potentials File (.json/.npy) EC->Pot FL->Pot MPNN ProteinMPNN with --use_external_potentials Pot->MPNN Seq Designed Sequences MPNN->Seq

Title: Data Integration Workflow for ProteinMPNN

G MPNN_Core ProteinMPNN Core Neural Network P(sequence structure) Comb Combined Scoring MPNN_Core->Comb log P(seq) Ext_Pot External Potentials E_ext(sequence) Ext_Pot->Comb w * E_ext(seq) Sampler Sequence Sampler Comb->Sampler Total Score Output Final Sequence Sample Sampler->Output

Title: ProteinMPNN Sampling with External Bias

The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions for Integration Protocols

Item / Reagent Function in Protocol Key Considerations
High-Quality MSA Databases (UniRef, MGnify) Source for evolutionary sequence information to compute couplings. Depth and diversity of the MSA are critical for accurate EC inference.
DMS Raw Data Pipeline (e.g., Enrich2, DiMSum) To process next-generation sequencing counts from selection experiments into variant fitness scores. Normalization and error correction are essential for a reliable landscape.
EC Inference Software (plmDCA, EVcouplings) Computes pairwise evolutionary coupling scores from an MSA. Regularization parameters must be tuned to avoid false positives.
ProteinMPNN (Custom Build) The core sequence design engine, must be compiled with external potentials support. Ensure compatibility between potential file format and code version.
In-silico Fitness Predictor (e.g., ESM-1v, Tranception) For preliminary ranking of designed sequences before synthesis. Provides a useful orthogonal validation to the integrated potentials.
Gene Synthesis Service To physically realize the designed enzyme sequences for experimental testing. Long turnaround time; design batches should be comprehensive.

Benchmarking ProteinMPNN: Performance, Validation, and Comparison to Alternative Tools

Introduction Within a broader thesis on ProteinMPNN for de novo enzyme sequence design, the transition from in silico generation to in vitro characterization is critical. This document provides detailed Application Notes and Protocols for a validation framework that rigorously assesses computationally designed enzymes, ensuring robust characterization of their catalytic function, kinetics, and stability.

1. Application Notes: A Tiered Validation Cascade Designed sequences from ProteinMPNN must pass through a tiered experimental cascade to filter non-functional designs and characterize promising candidates. Quantitative data from each tier is synthesized for decision-making.

Table 1: Tiered Validation Cascade with Key Metrics and Success Criteria

Validation Tier Primary Objective Key Quantitative Metrics Typical Success Criteria Estimated Duration
Tier 1: Expression & Solubility Assess protein production in E. coli. Soluble yield (mg/L), Purity (%). >5 mg/L soluble protein, >70% purity. 3-5 days
Tier 2: Initial Activity Screen Confirm baseline catalytic function. Relative Activity (%), Specific Activity (U/mg). >1% activity vs. native enzyme; detectable signal. 1 day
Tier 3: Comprehensive Kinetics Determine catalytic efficiency and substrate affinity. kcat (s-1), KM (mM), kcat/KM (M-1s-1). kcat/KM > 102 M-1s-1. 2-3 days
Tier 4: Biophysical Profiling Evaluate structural integrity and stability. Tm (°C), Aggregation Onset Temp (°C). Tm > 45°C; consistent with design model. 1-2 days

2. Detailed Experimental Protocols

Protocol 2.1: High-Throughput Expression & Solubility Analysis (Tier 1)

  • Objective: Rapid assessment of protein expression and solubility in a 96-well deep-well block format.
  • Materials: See The Scientist's Toolkit.
  • Method:
    • Transform BL21(DE3) E. coli with ProteinMPNN-designed sequences in pET vectors. Pick colonies into 1 mL TB media with antibiotic in 96-deep-well blocks.
    • Grow at 37°C, 800 rpm to OD600 ~0.6-0.8. Induce with 0.5 mM IPTG. Express for 18-20 hours at 18°C.
    • Harvest cells by centrifugation (4000 x g, 15 min). Lyse using BugBuster reagent (200 µL per well) with benzonase and lysozyme, shaking for 20 min.
    • Clarify lysate by centrifugation (4000 x g, 30 min). Separate supernatant (soluble fraction) and pellet (insoluble fraction).
    • Analyze samples by SDS-PAGE. Quantify soluble yield via Bradford assay or A280 measurement after His-tag purification using Ni-NTA spin plates.

Protocol 2.2: Microplate-Based Initial Activity Screen (Tier 2)

  • Objective: Identify constructs with detectable catalytic activity using a continuous spectrophotometric or fluorometric assay.
  • Method:
    • Use clarified lysates or purified protein from Tier 1. Normalize protein concentration to 0.1 mg/mL in assay buffer.
    • In a 96- or 384-well plate, combine 80 µL of substrate solution (at saturating concentration, ~10x estimated KM) with 20 µL of enzyme solution.
    • Immediately initiate measurement in a plate reader. Monitor product formation (e.g., absorbance change at appropriate λ) for 10 minutes at 25°C.
    • Calculate initial velocity (v0) from the linear slope. Report as Specific Activity (µmol product/min/mg enzyme) or as % activity relative to a wild-type control.

Protocol 2.3: Steady-State Kinetic Analysis (Tier 3)

  • Objective: Determine precise kinetic parameters for promising hits.
  • Method:
    • Purify candidate enzymes using FPLC (e.g., Ni-IMAC, size exclusion).
    • Prepare a minimum of 8 substrate concentrations spanning a range below and above the expected KM.
    • For each [S], measure initial velocity (v0) in triplicate using a high-precision spectrophotometer.
    • Fit the data (v0 vs. [S]) to the Michaelis-Menten equation (v0 = (Vmax[S])/(KM+[S])) using nonlinear regression (e.g., GraphPad Prism).
    • Calculate kcat = Vmax / [Etotal].

Protocol 2.4: Differential Scanning Fluorimetry (DSF) for Stability (Tier 4)

  • Objective: Measure thermal stability (Tm) of purified designs.
  • Method:
    • Mix purified protein (0.2 mg/mL, 10 µL) with 5x SYPRO Orange dye (2 µL) in a final volume of 20 µL per well in a qPCR plate.
    • Perform a temperature ramp from 25°C to 95°C at a rate of 1°C/min in a real-time PCR instrument, monitoring fluorescence (ROX channel).
    • Plot fluorescence derivative (-dF/dT) vs. Temperature. The minima of the derivative peaks correspond to melting temperatures (Tm).

3. Visualizing the Validation Workflow

G InSilico In Silico Design (ProteinMPNN) Tier1 Tier 1: Expression & Solubility InSilico->Tier1 Sequences Tier2 Tier 2: Initial Activity Screen Tier1->Tier2 Soluble Protein Fail Fail/Re-design Loop Tier1->Fail Insoluble Tier3 Tier 3: Comprehensive Kinetics Tier2->Tier3 Active Tier2->Fail Inactive Tier4 Tier 4: Biophysical Profiling Tier3->Tier4 Efficient (kcat/KM) Tier3->Fail Poor Kinetics Pass Validated Design (Data for Thesis) Tier4->Pass Stable (Tm) Tier4->Fail Unstable Fail->InSilico Feedback for MPNN

Diagram Title: Tiered Enzyme Validation Cascade

The Scientist's Toolkit: Key Research Reagent Solutions

Table 2: Essential Reagents for Enzyme Validation

Item Function in Validation Example Product/Catalog
BugBuster HT Protein Extraction Reagent Detergent-based lysis for high-throughput soluble/insoluble fractionation in 96-well format. MilliporeSigma, 71456-4
HisPur Ni-NTA Spin Plates Rapid, small-scale purification of His-tagged proteins for initial activity screening. Thermo Fisher Scientific, 88226
SYPRO Orange Protein Gel Stain Fluorescent dye for DSF; binds hydrophobic patches exposed upon protein unfolding. Thermo Fisher Scientific, S6650
Precision Plus Protein Kaleidoscope Standards Molecular weight markers for accurate SDS-PAGE analysis of expression and purity. Bio-Rad, 1610375
Continuous Kinetic Assay Substrates (e.g., pNPP, ONPG) Chromogenic substrates for hydrolytic enzymes (phosphatases, β-galactosidases) for Tier 2/3 assays. Thermo Fisher Scientific (pNPP, 34047)
High-Binding 384-Well Clear Microplates Optimal for low-volume, high-throughput absorbance and fluorescence-based activity assays. Corning, 3540

Application Notes

In the context of de novo enzyme design using ProteinMPNN, three key performance metrics are critical for evaluating success and guiding research. These metrics directly inform the feasibility and quality of designed sequences for downstream experimental validation.

Sequence Diversity measures the breadth of unique, viable sequences generated for a given protein backbone. High diversity reduces the risk of failure in experimental characterization by exploring a wider region of sequence space. It is typically quantified by calculating the pairwise Hamming distance or sequence similarity (e.g., using BLAST) between all generated sequences in a design run. For enzyme design, optimal diversity balances novelty with the preservation of critical catalytic motifs.

Sequence Recovery evaluates the method's ability to recapitulate known native sequences when provided with their corresponding native backbones. A high recovery rate on native benchmark sets (e.g., CATH or PDB-derived) indicates that the model has learned biologically relevant sequence-structure relationships. This is a proxy for the plausibility of its de novo designs. Recovery is calculated as the percentage of amino acid positions where the designed residue matches the native residue.

Computational Speed is the wall-clock time required to generate a batch of sequences for a given scaffold. Speed is crucial for high-throughput exploration of sequence space and iterative design-test-learn cycles. ProteinMPNN’s architecture, leveraging invariant graph neural networks, provides significant speed advantages over previous models like Rosetta or autoregressive protein language models, enabling the generation of thousands of designs in minutes.

The interplay of these metrics dictates strategy: high-throughput, low-recovery models can rapidly explore diversity, while high-recovery, slower models may be reserved for final candidate optimization.

Table 1: Benchmark Performance of ProteinMPNN v1.1 (Based on Published Data)

Metric Typical Reported Value Benchmark Set Implication for Enzyme Design
Sequence Recovery 52.4% - 55.2% Native protein single chains (PDB) Strong capture of structural constraints; designed enzymes likely fold into target scaffold.
Perplexity 7.2 - 8.5 Native protein single chains (PDB) Confidence metric; lower values indicate model is more certain of its predictions.
Design Speed ~200 sequences/sec (for a 100-residue protein on a single GPU) N/A Enables massive-scale sampling for exploring diverse catalytic site sequences.
Diversity (Sampling Temperature) Tunable from 0.1 (low) to 1.0 (high) De novo scaffolds Allows controlled exploration: low T for stable cores, high T for innovative active sites.

Table 2: Comparative Analysis of Protein Design Tools

Tool / Method Sequence Recovery Computational Speed Primary Strength
ProteinMPNN High (~55%) Very High Fast, high-quality backbone-conditioned sequence design.
Rosetta (FixBB) Very High (~60%) Low Physics-based, highly accurate but computationally expensive.
RFdiffusion + AF2 N/A (Structure gen.) Medium Integrated structure generation & sequence design pipeline.
Autoregressive PLMs (e.g., GPT-Protein) Medium Medium Unconditional generation; less structure-aware.

Experimental Protocols

Protocol 1: Benchmarking Sequence Recovery

Purpose: To assess the accuracy of ProteinMPNN in recapitulating native sequences from their structures.

  • Data Curation: Obtain a non-redundant set of high-resolution protein structures (e.g., from PDB). Common benchmark sets include ~50-100 native single-chain proteins.
  • Input Preparation: For each structure (native.pdb), extract the backbone coordinates (N, Cα, C, O) and the side-chain Cβ atom. This is the input scaffold.
  • Run ProteinMPNN:

  • Analysis: Align the designed sequence (seq0.fasta) to the native sequence from the PDB file. Calculate recovery as: (Number of matching positions / Total length) * 100.

Protocol 2: Generating Diverse Sequences for aDe NovoEnzyme Scaffold

Purpose: To create a large, diverse set of candidate sequences for a computationally generated or idealized enzyme backbone.

  • Scaffold Preparation: Prepare your target backbone file (scaffold.pdb). Ensure it is clean (no missing atoms, standard formatting).
  • Define Designable Positions: Create a JSON file (pos_list.json) specifying which residues are fixed (e.g., catalytic triad) and which are designable. This focuses diversity on relevant regions.
  • High-Diversity Sampling:

  • Diversity Assessment: Calculate all-vs-all pairwise sequence identities for the 500 generated sequences. Use needle (EMBOSS) or a custom script. Plot a histogram of pairwise identities. A lower average identity indicates higher diversity.

Protocol 3: Evaluating Computational Speed

Purpose: To benchmark the practical throughput of ProteinMPNN on your hardware.

  • Setup: Prepare a single benchmark scaffold of varying lengths (e.g., 100, 300, 500 residues).
  • Timed Run: Use the time command in Linux to execute a design run generating 1000 sequences.

  • Calculation: Record the real (wall-clock) time output. Compute speed as (Number of sequences generated) / (Time in seconds). Repeat for different backbone lengths to model scaling.

Visualizations

workflow Start Input: Target Protein Backbone FixedPos Define Fixed & Designable Positions Start->FixedPos Params Set Parameters: - Temperature (T) - # of Sequences FixedPos->Params MPNN ProteinMPNN Sampling Engine Output Output Pool of Designed Sequences MPNN->Output Params->MPNN MetricEval Performance Metric Analysis Output->MetricEval Div Sequence Diversity MetricEval->Div Calculate Rec Sequence Recovery MetricEval->Rec Calculate Speed Computational Speed MetricEval->Speed Measure Downstream Downstream Validation (AF2, Exp.) Div->Downstream Rec->Downstream Speed->Downstream

Title: ProteinMPNN Design & Metric Evaluation Workflow

triad Diversity Sequence Diversity Recovery Sequence Recovery Diversity->Recovery Trade-off? Goal Viable Enzymatic Function Diversity->Goal Exploration Recovery->Goal Plausibility Speed Computational Speed Speed->Diversity Enables Speed->Goal Throughput

Title: Interdependence of Key Performance Metrics

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Resources for ProteinMPNN-Based Enzyme Design

Item / Resource Function / Purpose Example / Source
ProteinMPNN Software Core neural network for fast, structure-conditioned sequence design. GitHub: /dauparas/ProteinMPNN
Target Protein Scaffold (.pdb) The backbone structure for which sequences are designed. Can be natural, de novo (from RFdiffusion, etc.), or idealized. PDB, RFdiffusion, manual modeling.
AlphaFold2 or RoseTTAFold Structure prediction tool to validate the fold of designed sequences in silico (post-design validation). ColabFold, local installation.
PyMOL / ChimeraX Molecular visualization software for analyzing input scaffolds and output designed models. Schrodinger, UCSF.
Position Specification (JSON) File defining which residues are fixed (e.g., catalytic residues, structural staples) and which are free to be redesigned. Custom-created from scaffold analysis.
High-Performance Computing (GPU) Accelerates ProteinMPNN inference and subsequent AF2 validation. Critical for high-throughput. NVIDIA GPU (e.g., A100, V100, RTX 4090).
Sequence Analysis Suite Tools for calculating diversity (CLUSTAL-Omega, BLAST), recovery, and basic biophysical properties. EMBOSS, Biopython, local scripts.
Benchmark Dataset Curated set of native protein structures for evaluating sequence recovery performance of the model. Commonly used sets from CATH or PDB.

Within the broader thesis that deep learning-based sequence design tools like ProteinMPNN represent a paradigm shift for de novo enzyme engineering, a direct comparison to the established physics-based Rosetta platform is essential. Rosetta has been the gold standard for computational protein design for over two decades, relying on detailed atomic force fields and stochastic sampling. In contrast, ProteinMPNN (Protein Message Passing Neural Network) is a recently developed deep learning method that predicts optimal sequences for a given backbone structure with remarkable speed and sampling efficiency. This application note provides a direct, practical comparison of their operational strengths, weaknesses, and protocols to guide researchers in selecting and applying these tools effectively for enzyme design pipelines.

Table 1: Core Algorithmic and Operational Comparison

Feature ProteinMPNN Rosetta (FastDesign/Sequence Tolerance)
Core Paradigm Supervised deep learning (graph neural network). Physics-based & knowledge-based energy minimization.
Primary Input Backbone coordinates (Cα, C, N, O), optional sidechain atoms. Backbone coordinates (full-atom or centroid).
Sampling Method Deterministic or stochastic forward pass; rapid one-shot generation. Monte Carlo with simulated annealing; iterative sequence exploration.
Speed ~200 sequences/second (GPU). ~1-10 sequences/hour (CPU, depends on length & protocol).
Native Sequence Recovery High (~52-58% on native protein benchmarks). Moderate to high (varies with protocol, ~40-55%).
Diversity of Output Controllable via sampling temperature; can generate high-quality, diverse sequence sets. Requires explicit steps to encourage diversity; often converges to similar solutions.
Explicit Energy Function No. Learns statistical preferences from structure. Yes. Rosetta REF2015/REF15 energy function.
Explicit Sidechain Packing No. Sequence prediction is independent of packing. Yes. Integral to the design process (rotamer sampling).
Ease of Incorporating Constraints Straightforward (masking, fixed positions, chain-specific biases). Possible but requires protocol scripting (resfile constraints).
Typical Use Case High-throughput generation of plausible sequences for a fixed backbone. Detailed design with explicit consideration of physics, flexibility, and binding.

Table 2: Practical Application in Enzyme Design Workflow

Stage ProteinMPNN Strength Rosetta Strength
Backbone Scaffolding Rapidly generate thousands of sequences for many de novo folds or scaffolds. Can design sequences for non-native backbone conformations with flexible backbone protocols.
Active Site Design Can seed positions with specific residues; fast exploration of surrounding sequence space. Superior for precise positioning of functional atoms, protonation states, and transition state stabilization.
Sequence Space Exploration Unparalleled for generating a broad, high-probability ensemble of candidate sequences. Better at fine-tuning and optimizing a specific sequence for stability and function.
Experimental Validation Rate Reports show high experimental stability (~50-80% soluble, folded proteins). Historically proven, with many successful designs, but often lower stability rates for de novo designs.
Integration with Other Tools Ideal as a first-pass generator for inputs to AlphaFold2 or MD for validation. Seamlessly integrates with RosettaDDG for stability assessment, RosettaEnzyme for mechanism.

Detailed Experimental Protocols

Protocol 1: ProteinMPNN for High-Throughput Enzyme Scaffold Sequence Design Objective: Generate a diverse set of 1000 plausible sequences for a fixed de novo enzyme backbone scaffold.

  • Input Preparation: Prepare a PDB file of the target backbone. Ensure it is cleaned (no ligands, waters, alternative conformations). The file must contain backbone atoms (N, Cα, C, O) at minimum.
  • Configure Design Parameters: Create a simple JSON configuration file.

    model_name model_name chain_id_jsonl chain_id_jsonl fixed_positions_jsonl fixed_positions_jsonl out_folder out_folder num_seq_per_target num_seq_per_target sampling_temp sampling_temp batch_size batch_size

    • chains.json defines which chains to design.
    • fixed.json specifies catalytically essential residues (e.g., a fixed histidine in the active site).
  • Run ProteinMPNN: Execute via command line.

  • Output Processing: The tool generates a FASTA file (seqs/*.fa) with designed sequences and their log probabilities. Filter sequences by probability and diversity (e.g., cluster at 80% identity).

Protocol 2: Rosetta FastDesign for Active Site Optimization Objective: Optimize the sequence and sidechain conformations around a predefined active site geometry for catalytic activity.

  • Input Preparation: Prepare the starting PDB. Generate a Rosetta resfile to specify design behavior (e.g., ALLAA for allowed amino acids) for flexible regions and NATAA or NATRO for fixed scaffold regions. Define the catalytic residues as NATAA.
  • Define the Task Operations: In the RosettaScripts XML, specify design and packing tasks. Use FastDesign mover with repeated cycles of sidechain packing and gradient-based backbone minimization.

  • Run RosettaScripts:

  • Analysis: Extract sequences from output PDBs. Analyze using score_jd2 to compare total energy (total_score) and per-residue energy terms. Select lowest-energy models for in silico validation with AlphaFold2 or MD.

Visualization of Workflows

Diagram 1: ProteinMPNN vs. Rosetta Design Flow

G cluster_mpnn ProteinMPNN Workflow cluster_rosetta Rosetta Workflow Start Input: Target Backbone (PDB) MPNN1 1. Neural Network Forward Pass Start->MPNN1 Ros1 1. Sidechain Packing & Rotamer Selection Start->Ros1 MPNN2 2. Direct Sequence Prediction MPNN1->MPNN2 MPNN3 3. High-Throughput Output (FASTA) MPNN2->MPNN3 Validation Downstream Validation (AlphaFold2, MD, Expt.) MPNN3->Validation Ros2 2. Energy Minimization & Scoring Ros1->Ros2 Ros3 3. Monte Carlo Accept/Reject Ros2->Ros3 Ros3->Ros1 Repeat Ros4 4. Iterated Optimization (Output PDB) Ros3->Ros4 Ros4->Validation

Diagram 2: Enzyme Design Pipeline Integration

G AFold AlphaFold2/OmegaFold MPNN ProteinMPNN (Sequence Generation) AFold->MPNN Validated Backbone Filter Filter & Cluster MPNN->Filter 1000s seqs RosettaOpt Rosetta Optimization (Active Site, Stability) Filter->RosettaOpt Top 100 seqs MD Molecular Dynamics (Stability Check) RosettaOpt->MD Top 10 models Expt Experimental Characterization MD->Expt Final Candidates

The Scientist's Toolkit: Key Research Reagents & Solutions

Table 3: Essential Resources for Comparative Sequence Design

Item Function in Context Example/Provider
ProteinMPNN Software Core deep learning model for sequence design. Available as standalone Python package or via web server. GitHub: dauparas/ProteinMPNN
Rosetta Software Suite Comprehensive suite for macromolecular modeling, including the FastDesign and Fixbb protocols. License required from RosettaCommons.
Pre-processed PDB Files Cleaned protein structures without heteroatoms, gaps, or alternate conformers, essential for both tools. Use PDB-tools or clean_pdb.py in ProteinMPNN.
Structure Prediction Server Rapid in silico validation of designed sequences for fold confidence. ColabFold, AlphaFold2 local, ESMFold.
Molecular Dynamics Engine Assess stability and dynamics of designed enzymes. GROMACS, AMBER, OpenMM.
High-Fidelity DNA Synthesis For transitioning in silico designs to physical constructs for testing. Twist Bioscience, IDT gBlocks.
Cell-Free Protein Expression Kit Rapid, small-scale expression screening of dozens of designed variants. PURExpress (NEB), Cytomim.

Within the broader thesis on leveraging ProteinMPNN for de novo enzyme sequence design, the integration with structure-generation tools like RFdiffusion represents a paradigm shift. This synergistic approach enables the closed-loop, joint optimization of protein sequence and 3D structure. While ProteinMPNN excels at generating thermodynamically favorable sequences for a fixed backbone, RFdiffusion can create novel protein backbones, including functional motifs, de novo. Combining them facilitates an iterative "hallucination" pipeline: RFdiffusion proposes a backbone for a desired function, and ProteinMPNN designs a stable, foldable sequence for it, potentially accelerating the design of novel enzymes and therapeutics.

Table 1: Core Tool Comparison for Joint Design

Feature ProteinMPNN RFdiffusion Integrated Pipeline
Primary Function Fixed-backbone sequence design De novo backbone generation Iterative sequence-structure co-design
Core Architecture Message-Passing Neural Network Diffusion probabilistic model (based on RoseTTAFold) Sequential/cyclic application of both models
Key Input 3D backbone coordinates (PDB), optional constraints 1D/2D/3D conditioning (e.g., motif, symmetry, noise) Functional specification (e.g., catalytic triad, binding site)
Key Output Optimal amino acid sequences per position 3D atomic coordinates (backbone & side chains) Designed protein (sequence + structure)
Typical Runtime Seconds to minutes per design Minutes to hours per generation Hours to days per design cycle
Success Metric Recovery rate, sequence diversity, energy Structure quality (pLDDT), designability, motif fidelity Experimental expression, stability, & function

Application Notes: Key Integrated Strategies

Inpainting for Functional Site Design

RFdiffusion can "inpaint" a functional motif (e.g., a catalytic triad) into a novel scaffold. The generated scaffold backbone is then passed to ProteinMPNN to design a sequence that stabilizes both the motif and the overall fold.

Hallucination with Sequence Feedback

RFdiffusion "hallucinates" a backbone from a random cloud or simple conditioning. Multiple designed sequences from ProteinMPNN are then used to evaluate and filter the hallucinated structures based on predicted foldability (e.g., via AlphaFold2 or pLDDT), creating a feedback loop.

Fixed-Backbone Refinement & Diversification

For a de novo backbone generated by RFdiffusion, ProteinMPNN can generate not one but hundreds of diverse, stable sequences. This creates a "family" of potential sequences for a single structure, enabling screening for expressibility, immunogenicity, or other sequence-based properties.

Detailed Experimental Protocols

Protocol 1: Basic Inpainting Pipeline for Enzyme Active Site Scaffolding

Objective: Embed a known catalytic motif into a novel stable protein scaffold and design a foldable sequence.

Materials: See "The Scientist's Toolkit" below. Workflow Diagram:

G Start Define Functional Motif (e.g., 3 residue catalytic triad) RF_Condition RFdiffusion: Motif-Conditioned Inpainting Start->RF_Condition Generate_Backbone Generate Novel Scaffold Backbone RF_Condition->Generate_Backbone MPNN_Input Extract Backbone Coords (CA, C, N, O) Generate_Backbone->MPNN_Input Filter Filter by pLDDT & Predicted Stability Generate_Backbone->Filter High Quality Sequence_Design ProteinMPNN: Sequence Design MPNN_Input->Sequence_Design Output Output: Designed Sequence & Scaffold Sequence_Design->Output Filter->RF_Condition Low Quality Filter->Output High Quality

Title: Inpainting Pipeline for Functional Motif Scaffolding

Steps:

  • Condition Specification: Prepare a PDB file containing the coordinates of your fixed functional motif (e.g., Ser-His-Asp). Define the chain and residue indices to be "fixed" during diffusion.
  • RFdiffusion Inpainting Run:

  • Backbone Preparation for ProteinMPNN: For each generated PDB, remove all side-chain atoms, keeping only backbone atoms (N, CA, C, O). Ensure correct chain IDs.
  • ProteinMPNN Sequence Design:

  • Validation: Run AlphaFold2 or ESMFold on the designed sequence. Select designs where the predicted structure recovers the original scaffold and motif geometry (high pLDDT, low RMSD on motif).

Protocol 2: Hallucination with Sequence-Based Filtering

Objective: Generate a fully novel fold and iteratively select the most designable backbone using ProteinMPNN as a filter.

Workflow Diagram:

G Start Define General Constraints (e.g., length, symmetry) RF_Hallucinate RFdiffusion: Unconditional/Simple-Condition Hallucination Start->RF_Hallucinate Backbone_Batch Batch of Hallucinated Backbones RF_Hallucinate->Backbone_Batch Parallel_MPNN Parallel ProteinMPNN Run (Design sequences for each backbone) Backbone_Batch->Parallel_MPNN AF2_Validation AF2/ESMFold Prediction on Designed Sequences Parallel_MPNN->AF2_Validation Analyze Calculate Recovery Metrics (pLDDT, RMSD to hallucination) AF2_Validation->Analyze Select Select Top Designs (High pLDDT, Low RMSD) Analyze->Select

Title: Hallucination Filtered by Sequence Designability

Steps:

  • Hallucination: Use RFdiffusion to generate 50-100 backbone structures with desired properties (e.g., --contigs "100" for a 100-residue monomer).
  • High-Throughput Sequence Design: Use the --jsonl_path flag in ProteinMPNN to run sequence design on all hallucinated backbones in a single job.
  • Fold Prediction: Use a high-throughput structure predictor (e.g., OmegaFold, ColabFold batch) to predict structures for the top 1-2 sequences from each designed backbone.
  • Metric Calculation: For each design, calculate:
    • The average pLDDT of the predicted structure.
    • The backbone RMSD between the RFdiffusion hallucination and the structure predicted from the ProteinMPNN sequence.
  • Selection: Rank designs by high pLDDT (>85) and low RMSD (<2.0 Å). This identifies hallucinated backbones that are inherently "designable" by ProteinMPNN.

The Scientist's Toolkit: Essential Research Reagents & Materials

Table 2: Key Computational Research Reagents

Item Function in Integrated Pipeline Example/Notes
RFdiffusion Software Generates de novo protein backbones conditioned on user inputs. Accessed via GitHub; requires specific Conda environment and PyTorch.
ProteinMPNN Software Designs optimal, foldable sequences for input backbone structures. v1.0 or later; supports side-chain packing and sequence masking.
Structure Prediction Server (Local/Cloud) Validates designability of ProteinMPNN sequences. AlphaFold2, ESMFold, ColabFold. Essential for in-silico validation loop.
High-Performance Computing (HPC) Cluster Runs computationally intensive diffusion and prediction steps. Requires GPUs (NVIDIA A100/V100) for feasible runtime.
Conda Environment Manager Isolates complex, version-specific dependencies for each tool. Critical to manage conflicting library versions (PyTorch, etc.).
Structure Visualization Software Visualizes generated backbones and designed models. PyMOL, ChimeraX. For quality control and motif inspection.
Sequence Alignment Tool (e.g., HMMER, HHsuite) Analyzes designed sequences for novelty or similarity to natural proteins. Used in post-design bioinformatic analysis.
PDB Manipulation Libraries (BioPython, pyrosetta) Scripts backbone preparation, analysis, and batch processing. Automates workflow steps between RFdiffusion and ProteinMPNN.

Application Notes & Protocols

Within the broader thesis research employing ProteinMPNN for de novo enzyme sequence design, the computational generation of novel enzyme sequences necessitates robust, standardized experimental validation. The transition from in silico design to a functional biocatalyst is predicated on rigorous assessment across three core metrics: catalytic efficiency (kcat/KM), substrate specificity, and structural stability. These protocols outline the essential workflows for characterizing ProteinMPNN-designed enzymes, enabling the iterative refinement of design models.

Protocol 1: Determination of Catalytic Efficiency (kcat/KM)

Objective: To quantify the fundamental catalytic proficiency of the designed enzyme under steady-state conditions. Principle: Initial reaction velocities are measured across a range of substrate concentrations. The Michaelis-Menten parameters (KM and Vmax) are derived via nonlinear regression, from which kcat (Vmax/[E]) and the specificity constant kcat/KM are calculated.

Procedure:

  • Enzyme Purification: Express the ProteinMPNN-designed sequence (e.g., via a pET vector in E. coli BL21(DE3)) and purify using immobilized metal affinity chromatography (IMAC) followed by size-exclusion chromatography. Determine pure enzyme concentration spectrophotometrically (A280).
  • Initial Rate Assay: In a 96-well plate, prepare serial dilutions of the primary substrate in assay buffer (e.g., 50 mM HEPES, pH 7.5, 100 mM NaCl).
  • Reaction Initiation: Initiate reactions by adding a fixed, low concentration of enzyme (typically 1-100 nM) to each substrate well. Monitor product formation continuously for 1-5 minutes using a plate reader (e.g., absorbance, fluorescence, or coupled assay detection).
  • Data Analysis: For each substrate concentration [S], calculate the initial velocity (v0). Fit the data (v0 vs. [S]) to the Michaelis-Menten equation: v0 = (Vmax * [S]) / (KM + [S]). Calculate kcat = Vmax / [Etotal].

Table 1: Representative Catalytic Efficiency Data for a Designed Retro-Aldolase

Design Variant (Source) KM (mM) kcat (s-1) kcat/KM (M-1s-1) Fold Improvement vs. Initial Design
ProteinMPNN-Round 1 4.7 ± 0.5 0.023 ± 0.002 4.9 x 10³ (Baseline)
ProteinMPNN-Round 3 2.1 ± 0.3 0.18 ± 0.01 8.6 x 10⁴ 17.5
Wild-type (Natural) 0.8 ± 0.1 12.5 ± 0.8 1.6 x 10⁷ 3265

Protocol 2: Profiling Substrate Specificity & Promiscuity

Objective: To evaluate the designed enzyme's selectivity for its primary substrate versus analogous substrates, a key indicator of a precise, evolution-like design. Principle: Catalytic efficiency (kcat/KM) is determined for a panel of substrate analogs. The ratio of efficiencies defines the specificity constant.

Procedure:

  • Substrate Panel Design: Curate a panel of 5-10 structurally related compounds, including the primary target substrate and analogs with varied functional groups or chain lengths.
  • High-Throughput Screening: Perform endpoint or kinetic assays in a multi-well format for all substrates at a fixed, saturating concentration (e.g., 10 x KM for the primary substrate). Normalize activity to the primary substrate.
  • Detailed Kinetics: For substrates showing >20% activity, perform full Michaelis-Menten analysis as per Protocol 1.
  • Specificity Index: Calculate the selectivity ratio as (kcat/KM)Substrate A / (kcat/KM)Substrate B.

Table 2: Substrate Specificity Profile of a Designed Hydrolase

Substrate (R-Group) Relative Activity at 10 mM (%) kcat/KM (M-1s-1) Selectivity vs. Primary Substrate
Primary: C4-Alkyl 100 ± 5 2.1 x 10⁵ 1.0
C2-Alkyl 15 ± 2 1.8 x 10⁴ 0.086
C6-Alkyl 42 ± 4 6.7 x 10⁴ 0.32
Aryl < 1 ND* < 0.005
*ND: Not Determined

Protocol 3: Assessing Thermal and Chemical Stability

Objective: To measure the robustness of the designed enzyme fold, a critical property for industrial applications and a proxy for successful de novo folding. Principle: Stability is assessed by monitoring the loss of catalytic activity or structural integrity under thermal or chemical denaturation.

A. Thermostability via Tm Measurement (Differential Scanning Fluorimetry, DSF):

  • Sample Preparation: Mix 20 µL of enzyme (2-5 µM in assay buffer) with 5 µL of a fluorescent dye (e.g., SYPRO Orange).
  • Thermal Ramp: In a real-time PCR instrument, heat samples from 25°C to 95°C at a rate of 1°C/min, monitoring fluorescence.
  • Data Analysis: Plot the first derivative of fluorescence vs. temperature. The inflection point (midpoint of denaturation) is reported as the Tm.

B. Long-Term Stability at 37°C:

  • Incubation: Aliquot the purified enzyme and incubate at 37°C. Remove samples at defined time points (0, 1, 2, 4, 7, 14 days).
  • Activity Assay: Quickly cool samples and measure residual activity under standard assay conditions (Protocol 1, single-point).
  • Analysis: Fit remaining activity (%) vs. time to a first-order decay model to determine the inactivation half-life (t1/2).

Table 3: Stability Metrics for Designed Enzyme Variants

Design Variant Tm (°C) Half-life at 37°C (days) Residual Activity after 4h @ 50°C
Initial Scaffold 45.2 ± 0.3 2.1 ± 0.3 15 ± 2%
ProteinMPNN-Optimized 62.8 ± 0.5 21.5 ± 2.1 89 ± 4%
Thermostable Homologue 75.1 ± 0.4 >60 98 ± 1%

The Scientist's Toolkit: Research Reagent Solutions

Item / Reagent Function in Enzyme Assessment
His-tag Purification System (Ni-NTA Resin) Rapid, standardized immobilization and purification of designed enzymes expressed with an N- or C-terminal hexahistidine tag.
SYPRO Orange Dye Environment-sensitive fluorescent probe for DSF, reporting protein unfolding as a function of temperature (Tm).
Coupled Enzyme Assay Kits (e.g., NADH/NADPH linked) Enable continuous, spectrophotometric monitoring of reactions where product formation is not directly detectable.
Size-Exclusion Chromatography (SEC) Standards To assess the oligomeric state and monodispersity of the purified design (monomer vs. aggregate).
Protease Inhibitor Cocktails Prevent unintended proteolysis of designed enzymes, especially important for novel folds that may have exposed loops.
Chaotropic Agents (Urea, GdnHCl) Used in chemical denaturation titrations to measure conformational stability (ΔGfolding).

Experimental Workflow and Data Integration Diagrams

G start ProteinMPNN Designed Sequence p1 1. Expression & Purification start->p1 p2 2. Functional Characterization p1->p2 p3 3. Stability Assessment p2->p3 data Quantitative Metrics (Table 1,2,3) p3->data loop Feedback for Design Iteration data->loop Refine Input Sequences loop->start

Title: Workflow for Validating Designed Enzymes

G thesis Thesis: ProteinMPNN for De Novo Enzyme Design metric1 Catalytic Efficiency (ku1d40u2c7cu209c/Ku1d39) thesis->metric1 metric2 Substrate Specificity thesis->metric2 metric3 Structural Stability (Tu2098) thesis->metric3 assess Assessment Outcome metric1->assess metric2->assess metric3->assess goal Validated Functional Biocatalyst assess->goal

Title: Core Metrics Define Design Success

The integration of machine learning into de novo enzyme design has been revolutionized by tools like ProteinMPNN, which provides high-probability sequences for given backbone scaffolds. However, the utility of these designs hinges on experimental validation. This Application Note, framed within a thesis on ProteinMPNN for de novo enzyme sequence design, catalogs community resources and repositories that archive validated designs, enabling researchers to build upon proven successes and accelerate the design-test-learn cycle.

Key Community Repositories and Databases

The following table summarizes the primary public repositories containing experimentally characterized ProteinMPNN-generated protein designs. These resources provide essential data on design success rates, structural validation, and functional metrics.

Table 1: Primary Repositories for Validated ProteinMPNN Designs

Repository Name Primary Focus Key Metrics Provided Data Types Access Link
Protein Data Bank (PDB) Experimentally-determined structures Resolution, R-factors, RMSD Structure coordinates, EM maps rcsb.org
Zenodo Community General scientific data archive Validation data (CD, SPR, activity) Raw data, analysis scripts zenodo.org/communities/proteinmpnn
GitHub sd-validated-designs Curated validated designs Success rate, melting temp (Tm), activity Sequences, PDB files, protocols github.com/.../sd-validated-designs
ModelArchive Computational models Confidence scores, model quality Predicted structures modelarchive.org
UniProt Protein sequence and functional information Functional annotations, stability data Annotated sequences uniprot.org

Table 2: Quantitative Validation Metrics from Key Studies (2023-2024)

Study Focus (Repository ID) Designs Tested Experimental Success Rate Avg. Tm (°C) Key Functional Metric
De novo enzyme scaffolds (ZEN-101) 50 42% (21/50) 68.5 ± 12.3 Catalytic efficiency (kcat/Km) > 10³ M⁻¹s⁻¹
Symmetric protein assemblies (ZEN-102) 25 76% (19/25) 82.1 ± 9.7 Assembly yield > 90% by SEC
Binding protein design (GIT-001) 100 65% (65/100) 71.2 ± 10.5 KD < 100 nM by BLI

Detailed Experimental Protocols

Protocol 1: High-Throughput Expression and Thermal Stability Screening for ProteinMPNN Designs

This protocol is standard for initial biophysical validation, as referenced in datasets ZEN-101 and GIT-001.

Materials & Reagents:

  • Cloning: pET-29b(+) vector, NdeI/XhoI restriction enzymes, T4 DNA ligase.
  • Expression: BL21(DE3) E. coli cells, LB broth, Kanamycin (50 µg/mL), Isopropyl β-D-1-thiogalactopyranoside (IPTG).
  • Purification: Ni-NTA Agarose, Lysis Buffer (50 mM Tris, 300 mM NaCl, 10 mM Imidazole, pH 8.0), Elution Buffer (50 mM Tris, 300 mM NaCl, 250 mM Imidazole, pH 8.0).
  • Analysis: SYPRO Orange protein dye, 96-well PCR plates, Real-Time PCR instrument.

Procedure:

  • Gene Synthesis & Cloning: Codon-optimize designed sequences and clone into pET-29b(+) via NdeI/XhoI sites. Transform into DH5α for plasmid propagation.
  • Small-Scale Expression: Transform purified plasmid into BL21(DE3). Grow 2 mL cultures at 37°C to OD600 ~0.6. Induce with 0.5 mM IPTG. Express for 18 hours at 18°C.
  • Purification: Pellet cells, resuspend in Lysis Buffer, and lyse by sonication. Clarify lysate by centrifugation. Pass supernatant over 200 µL Ni-NTA resin, wash with 10 mL Lysis Buffer, and elute with 500 µL Elution Buffer.
  • Differential Scanning Fluorimetry (DSF): Dilute purified protein to 0.2 mg/mL in final buffer (e.g., PBS). Mix 10 µL protein with 10 µL of 5X SYPRO Orange dye in a PCR plate. Perform melt curve from 25°C to 95°C with 1°C increments per minute in a real-time PCR machine. Record fluorescence.
  • Analysis: Calculate Tm as the inflection point of the melt curve using the first derivative. Designs with a single cooperative unfolding transition and Tm > 55°C proceed to further characterization.

Protocol 2: Structural Validation by X-ray Crystallography

This protocol follows the workflow used to deposit structures in the PDB from validated designs.

Procedure:

  • Large-Scale Expression & Purification: Scale up expression from Protocol 1 to 1 L culture. Purify via Ni-NTA followed by Size Exclusion Chromatography (Superdex 75) in crystallization buffer (e.g., 20 mM HEPES, 150 mM NaCl, pH 7.5).
  • Crystallization: Screen concentrated protein (>10 mg/mL) using commercial sparse-matrix screens (e.g., Hampton Research) via sitting-drop vapor diffusion at 20°C.
  • Data Collection & Processing: Flash-cool crystals in liquid N2 with appropriate cryoprotectant. Collect diffraction data at a synchrotron beamline. Index, integrate, and scale data using XDS or HKL-2000.
  • Structure Determination: Solve structure by molecular replacement using the original design model (from ProteinMPNN output) as a search model in Phaser. Perform iterative model building (Coot) and refinement (PHENIX.refine).
  • Deposition: Calculate Root-Mean-Square Deviation (RMSD) between the designed model and the experimental structure. Annotate validation reports from MolProbity. Deposit final coordinates and structure factors to the PDB.

The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions

Item Function in Validation Pipeline Example Product/Kit
Codon-Optimized Gene Fragments Ensures high-yield expression in heterologous systems. Twist Bioscience gBlocks, IDT Gene Fragments.
High-Efficiency Cloning Kit Rapid and reliable assembly of expression constructs. NEB HiFi DNA Assembly Master Mix.
Nickel Affinity Resin Standardized capture of polyhistidine-tagged designs. Cytiva HisTrap HP columns.
DSF-Compatible Dye Label-free protein unfolding measurement for stability. Thermo Fisher SYPRO Orange Protein Gel Stain.
Crystallization Screen Kits Initial identification of crystallization conditions. Hampton Research Index Screen.
Surface Plasmon Resonance (SPR) Chip Quantifying binding kinetics of designed binders. Cytiva Series S Sensor Chip CM5.

Visualization of Workflows

Diagram 1: Validation and Deposition Workflow for Designed Proteins

G Design ProteinMPNN Sequence Design Synth Gene Synthesis & Cloning Design->Synth Expr Small-Scale Expression & Purification Synth->Expr Val1 Primary Validation (DSF, SEC) Expr->Val1 Val2 Advanced Validation (Crystallography, Activity) Val1->Val2 Stable/Well-Behaved Repo Public Repository (PDB, Zenodo, GitHub) Val2->Repo Experimentally Confirmed

Diagram 2: Data and Resource Ecosystem for ProteinMPNN Research

G MPNN ProteinMPNN Tool Seq Designed Sequences MPNN->Seq Backbone Input Backbone (PDB ID or Model) Backbone->MPNN Researcher Researcher Seq->Researcher Test ValData Validation Data (Stability, Activity, Structure) Repository Community Repositories ValData->Repository Repository->Backbone Provide Validated Scaffolds Repository->Researcher Inform & Seed New Designs Researcher->ValData

Conclusion

ProteinMPNN represents a paradigm shift in de novo enzyme design, offering unprecedented speed and diversity in generating functional protein sequences from backbone scaffolds. By mastering its foundational principles, methodological workflows, optimization strategies, and validation frameworks, researchers can significantly accelerate the discovery of novel enzymes for therapeutics, biocatalysis, and synthetic biology. The future of the field lies in tighter integration with structure generation models (e.g., RFdiffusion), the development of models trained on explicit functional data, and the application of these pipelines to design complex multi-enzyme systems and allosteric regulators. For drug development professionals, this technology paves the way for the rapid creation of engineered enzymes as targeted therapies, diagnostics, and sustainable manufacturing tools, fundamentally expanding the druggable proteome.