CATNIP Tool: Revolutionizing AKG-Dependent Enzyme Discovery for Drug Development

Samuel Rivera Jan 09, 2026 436

This article provides a comprehensive guide to the CATNIP (Conserved Active-site Typing for Natural Product discovery) computational tool, designed to predict and analyze alpha-ketoglutarate (AKG) and Fe(II)-dependent oxygenases and oxidases.

CATNIP Tool: Revolutionizing AKG-Dependent Enzyme Discovery for Drug Development

Abstract

This article provides a comprehensive guide to the CATNIP (Conserved Active-site Typing for Natural Product discovery) computational tool, designed to predict and analyze alpha-ketoglutarate (AKG) and Fe(II)-dependent oxygenases and oxidases. Targeted at researchers and drug development professionals, we explore the foundational biology of these clinically significant enzymes, detail the methodological application of CATNIP for gene cluster analysis and enzyme function prediction, address common troubleshooting and optimization strategies for accurate results, and validate CATNIP's performance against other bioinformatics methods. This resource equips scientists with the knowledge to leverage CATNIP for accelerating natural product discovery and therapeutic target identification.

Understanding AKG/FeII Enzymes: Why They Are Critical Therapeutic Targets

The Biological Role of AKG/FeII-Dependent Enzymes in Human Disease and Natural Products

Alpha-ketoglutarate (AKG)/Fe(II)-dependent dioxygenases are a vast superfamily of enzymes critical for numerous biological processes, including hypoxia sensing, epigenetic regulation, collagen biosynthesis, and natural product biosynthesis. Dysregulation of these enzymes is implicated in cancer, anemia, fibrosis, and neurodegenerative diseases. The CATNIP (Computational Analysis Tool for Non-heme Iron and Peroxidase enzyme prediction) framework is a thesis research project aimed at developing a machine learning-based tool for the de novo prediction and functional annotation of AKG/FeII-dependent enzymes from genomic and metagenomic data. This tool leverages structural features and conserved sequence motifs to identify novel enzymes, accelerating the discovery of therapeutic targets and biosynthetic pathways for natural products. The following application notes and protocols are framed within the development and validation pipeline of the CATNIP tool.

Key Enzymes, Associated Diseases, and Natural Products

Table 1: Major Human AKG/FeII-Dependent Enzymes: Roles and Disease Links

Enzyme Primary Function Associated Human Disease Therapeutic Relevance
Prolyl Hydroxylase (PHD/EGLN) HIF-α hydroxylation, targeting for degradation Polycythemia, ischemic diseases, cancer PHD inhibitors (Roxadustat) for anemia
Factor Inhibiting HIF (FIH) HIF-α asparaginyl hydroxylation Altered metabolism in cancers Potential cancer therapeutics
TET Methylcytosine Dioxygenase DNA demethylation (5mC to 5hmC) Acute myeloid leukemia, neuro disorders Epigenetic therapy targets
JumonjiC (JMJC) Histone Demethylases Histone lysine demethylation Various cancers, developmental defects Targeted epigenetic inhibitors
Collagen Prolyl-4-Hydroxylase Collagen maturation Fibrosis, scleroderma, wound healing Inhibitors for anti-fibrotic therapy

Table 2: AKG/FeII Enzymes in Natural Product Biosynthesis

Enzyme Class Example Reaction Natural Product Bioactivity
Beta-Lactam Synthetase Ring formation in carbapenem Thienamycin (antibiotic) Broad-spectrum antibacterial
Clavaminic Acid Synthase Multiple oxidations & ring expansion Clavulanic acid β-lactamase inhibitor
Hydroxylases (e.g., AsmH) Aliphatic hydroxylation Antascomicin B Immunosuppressant, FKBP ligand
Dioxygenases in Siderophore Pathways Hydroxylation of amino acids Enterobactin, Desferrioxamine Iron chelation, antimicrobial

Research Reagent Solutions Toolkit

Table 3: Essential Reagents for AKG/FeII Enzyme Research

Reagent / Material Function / Application Example Vendor / Cat. No.
Recombinant AKG/FeII Enzyme (e.g., human PHD2) In vitro activity assays, inhibitor screening. Sigma-Aldrich (recombinant)
AKG (α-Ketoglutarate), Sodium Salt Essential co-substrate for enzymatic reactions. Thermo Fisher Scientific J60789
Ascorbic Acid (Vitamin C) Reductant to maintain Fe(II) in active state. MilliporeSigma A5960
Ferrous Ammonium Sulfate (Fe(II)) Source of catalytic iron cofactor. Alfa Aesar 33332
Succinate Detection Kit Quantifies reaction product (competitive with AKG). Abcam ab204718
Anti-5-Hydroxymethylcytosine (5hmC) Antibody Detects TET enzyme activity in cells/tissues. Cell Signaling 39769
Dimethyloxaloylglycine (DMOG) Pan-inhibitor of AKG/FeII dioxygenases (cell studies). Cayman Chemical 71210
HIF-PHD Inhibitor (e.g., Roxadustat) Specific inhibitor for hypoxic signaling studies. MedChemExpress HY-13426
Custom Peptide Substrates (e.g., HIF-1α CODD) Substrates for hydroxylase activity assays. GenScript (custom synthesis)
LC-MS/MS System Gold-standard for detecting hydroxylation products. Waters, Thermo Scientific

Experimental Protocols

Protocol 4.1: In Vitro Hydroxylase Activity Assay for Recombinant PHD2 Objective: To measure the enzymatic activity of a recombinant AKG/FeII enzyme by quantifying succinate production. Principle: The reaction converts AKG and O₂ to succinate and CO₂ proportionally to substrate hydroxylation.

  • Reaction Setup: In a 50 µL final volume, combine:
    • 50 mM HEPES buffer (pH 7.4)
    • 100 µM Fe(II)(NH₄)₂(SO₄)₂
    • 1 mM Ascorbate
    • 2 mM AKG
    • 5 µM Catalase (to degrade H₂O₂)
    • 10 µg recombinant PHD2 enzyme
    • 50 µM HIF-1α CODD peptide substrate.
  • Initiation & Incubation: Start reaction by adding AKG. Incubate at 37°C for 30 min.
  • Termination: Add 10 µL of 10% (v/v) H₂SO₄ to stop the reaction.
  • Detection: Use a commercial succinate colorimetric/fluorometric assay kit. Add 90 µL of assay mix to each well, incubate 30 min, and read absorbance/fluorescence. Compare to a succinate standard curve.
  • Controls: Include no-enzyme and no-substrate controls.

Protocol 4.2: Cellular 5hmC Detection via Dot Blot (TET Activity Readout) Objective: To assess global TET enzyme activity in cultured cells treated with inhibitors or under specific conditions.

  • Genomic DNA (gDNA) Isolation: Extract gDNA from ~1x10⁶ cells using a phenol-chloroform or column-based method. Measure concentration.
  • DNA Denaturation and Spotting: Denature 200-500 ng of gDNA in 0.4 M NaOH/10 mM EDTA at 95°C for 10 min, then chill on ice. Spot denatured DNA onto a nitrocellulose membrane using a vacuum manifold or manual pipetting. Air-dry.
  • Membrane Processing: Cross-link DNA via UV (120 mJ/cm²). Block membrane with 5% non-fat milk in TBST for 1 hr.
  • Immunodetection: Incubate with primary Anti-5hmC antibody (1:5000 in blocking buffer) overnight at 4°C. Wash 3x with TBST. Incubate with HRP-conjugated secondary antibody (1:5000) for 1 hr at RT. Wash and develop with ECL reagent. Image.
  • Normalization: Strip and re-probe membrane with Anti-ssDNA antibody (1:1000) to confirm equal DNA loading.

Protocol 4.3: CATNIP Tool Validation – In Silico Screening & In Vitro Confirmation Objective: To validate a novel AKG/FeII enzyme candidate predicted by the CATNIP tool.

  • In Silico Prediction: Input query protein sequence into CATNIP. Tool outputs predicted active site residues (2-His-1-Asp/Glu facial triad), Fe(II) and AKG binding motifs, and a probability score.
  • Cloning & Expression: Clone the candidate gene into an appropriate expression vector (e.g., pET series). Express in E. coli BL21(DE3) with 0.5 mM IPTG induction at 16°C for 18h.
  • Protein Purification: Purify recombinant protein via His-tag affinity chromatography. Confirm purity via SDS-PAGE.
  • Activity Screening Assay: Set up a generic hydroxylation assay (as in Protocol 4.1) with potential substrates (e.g., generic peptides, small molecules). Use LC-MS/MS to detect hydroxylation or succinate production.
  • Data Integration: Correlate in vitro activity with CATNIP's structural prediction to validate tool accuracy.

Visualizations

G HIF_Alpha HIF-α Subunit PHD PHD/EGLN (AKG/FeII Enzyme) HIF_Alpha->PHD Hydroxylation (Requires: O₂, AKG, Fe²⁺) HIF_Targets Angiogenesis Glycolysis Erythropoiesis HIF_Alpha->HIF_Targets HIF-α Stabilization & Transactivation VHL VHL E3 Ubiquitin Ligase Complex PHD->VHL Hydroxylated HIF-α Deg Proteasomal Degradation VHL->Deg Hypoxia Hypoxia / PHD Inhibitor Hypoxia->PHD Inhibits

Title: HIF-alpha Regulation by PHD AKG/FeII Enzyme

G Input Genomic/ Metagenomic Sequence Data CATNIP CATNIP Prediction Tool Input->CATNIP Fe_Motif Fe(II)-Binding Motif Detector CATNIP->Fe_Motif AKG_Motif AKG-Binding Motif Detector CATNIP->AKG_Motif ML_Class ML Classifier (RF/SVM) CATNIP->ML_Class Output Predicted AKG/FeII Enzyme (Prob. Score, Features) Fe_Motif->Output AKG_Motif->Output ML_Class->Output Val Experimental Validation (Cloning, Assay) Output->Val Candidate Selection

Title: CATNIP Tool Workflow for Enzyme Prediction

G DNA Methylated Cytosine (5mC) TET TET Dioxygenase (AKG/FeII Enzyme) DNA->TET Oxidation (Requires: O₂, AKG, Fe²⁺) Inter 5-Hydroxymethylcytosine (5hmC) TET->Inter Further Further Oxidation (5fC, 5caC) Inter->Further TET Activity Repair Base Excision Repair (BER) Further->Repair Demeth Unmethylated Cytosine Repair->Demeth

Title: TET Enzyme Pathway in Active DNA Demethylation

Application Notes

The CATNIP (Computational Analysis Toolkit for αKG/Fe(II)-Dependent Enzymes Prediction) framework provides a unified approach for the identification and characterization of α-ketoglutarate (αKG) and Fe(II)-dependent dioxygenases. These enzyme families—JmjC histone demethylases (KDMs), TET enzymes, and prolyl hydroxylases (PHDs)—play pivotal roles in epigenetics, hypoxia sensing, and cellular metabolism, making them prime targets for therapeutic intervention in cancer, anemia, and inflammatory diseases. CATNIP integrates sequence homology, 3D structural motif analysis, and cofactor binding site prediction to classify novel enzymes and predict their substrate specificity, directly supporting drug discovery pipelines by identifying potential off-target effects and designing selective inhibitors.

Table 1: Key Biochemical and Functional Parameters of αKG/Fe(II)-Dependent Enzyme Families

Enzyme Family Representative Members Primary Substrate Catalytic Product Apparent Km for αKG (μM) Required Cofactors Associated Diseases
JmjC KDMs KDM4A, KDM6A Methylated Lysine on Histones (H3K9me3, H3K27me3) Demethylated Lysine + Formaldehyde 5 - 50 αKG, Fe(II), O₂, Ascorbate Various cancers, Intellectual disability disorders
TET Enzymes TET1, TET2, TET3 5-Methylcytosine (5mC) in DNA 5-Hydroxymethylcytosine (5hmC) & further oxidized products 50 - 150 αKG, Fe(II), O₂, Ascorbate Leukemias, Myelodysplastic syndromes
Prolyl Hydroxylases PHD2 (EGLN1), HIF-PH Hypoxia-Inducible Factor-α (HIF-α) Hydroxylated Proline on HIF-α 20 - 100 αKG, Fe(II), O₂ Anemia, Chronic Kidney Disease, Ischemia

Table 2: CATNIP Prediction Output Metrics for Validated Targets

Predicted Enzyme (Uniprot ID) CATNIP Score (0-1) Predicted Family Experimental Validation Validated Substrate Reference Inhibitor (IC₅₀)
Q9H6I2 0.98 JmjC KDM Yes H3K36me2 JIB-04 (~0.5 μM)
Q6N021 0.94 TET Enzyme Yes 5mC in CpG context Bobcat339 (~2.1 μM)
Q9GZT9 0.87 Prolyl Hydroxylase Yes HIF-1α Proline 564 Roxadustat (FG-4592, ~0.5 μM)

Experimental Protocols

Protocol 1: CATNIP-BasedIn SilicoIdentification of αKG/Fe(II)-Dependent Enzymes

Objective: To identify and classify potential αKG/Fe(II)-dependent enzymes from a novel genomic or metagenomic dataset using the CATNIP tool.

Materials:

  • FASTA file of protein sequences of interest.
  • Installed CATNIP software suite (available from [repository link]).
  • Reference HMM profiles for JmjC, TET, and PHD catalytic domains.
  • High-performance computing cluster (recommended).

Methodology:

  • Preprocessing: Clean the input FASTA file. Remove redundant sequences using CD-HIT (95% identity cutoff).
  • Domain Scan: Run catnip_scan with the --hmmlib flag pointing to the curated library of αKG/Fe(II) enzyme hidden Markov models (HMMs).
  • Active Site Prediction: For sequences scoring above threshold (e.g., E-value < 1e-10), execute catnip_pocket to predict the 3D structure of the catalytic core and identify conserved residues for αKG binding (e.g., HXD/E...H motif) and Fe(II) coordination.
  • Classification: The tool assigns a family based on specific sequence motifs:
    • JmjC: Presence of JmjN domain upstream of the JmjC domain.
    • TET: Large C-terminal catalytic domain with a cysteine-rich region.
    • PHD: Double-stranded β-helix (DSBH) fold with characteristic β2β3 insert.
  • Output Analysis: Review the results.csv file containing CATNIP scores, predicted family, and active site residue coordinates. Sequences with scores >0.9 are high-confidence predictions.

Protocol 2:In VitroDemethylase/Hydroxylase Activity Assay for CATNIP-Validated Targets

Objective: To experimentally validate the catalytic activity of a protein predicted by CATNIP as a JmjC, TET, or PHD enzyme.

Materials:

  • Purified recombinant protein (from Protocol 3).
  • Substrate: Recombinant nucleosome (for JmjC), 5mC-containing DNA oligo (for TET), or HIF-α peptide (for PHD).
  • Reaction Buffer: 50 mM HEPES (pH 7.5), 50 μM (NH₄)₂Fe(SO₄)₂, 1 mM α-ketoglutarate, 2 mM Ascorbate.
  • Negative Control Buffer: Omit αKG and add 1 mM succinate (competitive inhibitor).
  • Detection Reagents: Anti-5hmC antibody (for TET), Anti-hydroxy-HIF-1α antibody (for PHD), or formaldehyde detection kit (for JmjC).

Methodology:

  • Reaction Setup: In a 50 μL volume, mix 50 nM purified enzyme with 1 μg substrate in reaction buffer. Prepare a negative control with succinate buffer. Incubate at 37°C for 1 hour.
  • Reaction Termination: Add 5 μL of 0.5 M EDTA to chelate Fe(II) and stop the reaction.
  • Product Detection:
    • For TET Enzymes: Transfer reaction to a nitrocellulose membrane via dot blot. Probe with anti-5hmC antibody (1:2000) and quantify using chemiluminescence.
    • For PHD Enzymes: Analyze by Western blot using an antibody specific for hydroxylated HIF-α (Pro564).
    • For JmjC Enzymes: Use a commercial formaldehyde dehydrogenase-coupled assay to measure released formaldehyde spectrophotometrically at 340 nm.
  • Kinetic Analysis: Vary αKG concentration (0-200 μM) while keeping other components constant. Plot initial velocity vs. concentration and fit data to the Michaelis-Menten equation using GraphPad Prism to derive Km and Vmax.

Protocol 3: Recombinant Protein Expression and Purification for Functional Assays

Objective: To produce and purify active, tag-free αKG/Fe(II)-dependent enzymes for biochemical characterization.

Materials:

  • Expression plasmid (pET-based) with gene of interest fused to a His₆-SUMO tag.
  • E. coli BL21(DE3) competent cells.
  • LB Broth, Kanamycin (50 μg/mL).
  • Induction Agents: 0.5 mM IPTG.
  • Lysis Buffer: 50 mM Tris-HCl (pH 8.0), 500 mM NaCl, 10% glycerol, 5 mM Imidazole, 1 mM TCEP.
  • Purification: Ni-NTA Agarose, ULP1 protease (for SUMO tag removal), Size-exclusion chromatography (SEC) column (Superdex 200).

Methodology:

  • Transformation & Expression: Transform plasmid into BL21(DE3). Grow 1L culture at 37°C to OD₆₀₀ ~0.6. Induce with 0.5 mM IPTG and incubate overnight at 18°C.
  • Cell Lysis: Harvest cells by centrifugation. Resuspend pellet in Lysis Buffer supplemented with protease inhibitors. Lyse by sonication on ice. Clarify lysate by centrifugation at 40,000 x g for 45 min.
  • Immobilized Metal Affinity Chromatography (IMAC): Load supernatant onto a Ni-NTA column pre-equilibrated with Lysis Buffer. Wash with 20 column volumes (CV) of Wash Buffer (Lysis Buffer with 30 mM imidazole). Elute with Elution Buffer (Lysis Buffer with 300 mM imidazole).
  • Tag Cleavage & Removal: Add ULP1 protease (1:100 w/w) to the eluate and dialyze overnight at 4°C against SEC Buffer (20 mM HEPES pH 7.5, 150 mM NaCl, 5% glycerol, 0.5 mM TCEP).
  • Final Purification: Pass the dialyzed sample over the Ni-NTA column again to capture the cleaved His₆-SUMO tag and uncleaved protein. The flow-through contains the tag-free protein. Concentrate and further purify using SEC. Assess purity by SDS-PAGE (>95%).

Visualization

G CATNIP CATNIP Tool Input: Protein Sequence HMM HMM Domain Scan CATNIP->HMM Motif Active Site Motif (HXD/E...H) Prediction HMM->Motif Classify Family Classification Motif->Classify JmjC JmjC KDM Epigenetic Regulation Classify->JmjC JmjN+JmjC TET TET Enzyme DNA Demethylation Classify->TET Cys-rich C-term PHD Prolyl Hydroxylase Hypoxia Response Classify->PHD DSBH fold Validate Biochemical Validation JmjC->Validate TET->Validate PHD->Validate

Title: CATNIP Prediction and Classification Workflow

G Substrate Substrate (Histone/DNA/Protein) Enzyme αKG/Fe(II) Enzyme (JmjC/TET/PHD) Substrate->Enzyme Reaction Dioxygenase Reaction Enzyme->Reaction Cofactors Cofactors: αKG, O₂, Fe(II), Ascorbate Cofactors->Reaction Products Products: Hydroxylated/Demethylated Substrate, Succinate, CO₂ Reaction->Products Inhibition Competitive Inhibition (e.g., Succinate, N-Oxalylglycine) Inhibition->Reaction

Title: General Catalytic Mechanism and Inhibition

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents for αKG/Fe(II)-Dependent Enzyme Research

Reagent Function/Application Example Product/Catalog #
Recombinant Human Enzymes Positive controls for activity assays and inhibitor screening. Active KDM4A (BPS Bioscience #50101), TET1 (RayBiotech #230-00163-100).
α-Ketoglutarate (αKG) Essential co-substrate for enzymatic reactions. Prepare fresh in buffer. Sigma-Aldrich K2010 (sodium salt).
(NH₄)₂Fe(SO₄)₂·6H₂O Source of Fe(II) cofactor. Must be prepared anaerobically to prevent oxidation. Sigma-Aldrich 203505.
Sodium Ascorbate Reducing agent to maintain Fe(II) in its active state. Sigma-Aldrich A7631.
N-Oxalylglycine (NOG) Cell-permeable, broad-spectrum competitive antagonist of αKG. Used as a pan-inhibitor control. Cayman Chemical 16856.
Anti-5hmC Antibody Specific detection of TET enzyme product (5-hydroxymethylcytosine) in DNA. Active Motif #39769.
HIF-1α (Pro564) Hydroxy-Specific Antibody Detection of PHD enzyme activity on HIF-α substrate. Cell Signaling Technology #3434.
Formaldehyde Dehydrogenase Assay Kit Quantifies formaldehyde released by JmjC KDM demethylation reactions. Sigma-Aldrich MAK228.
His₆-SUMO Tag Vector Enables high-yield expression and facile purification of tag-free, active enzyme. pET-His6-SUMO (Addgene #29659).
Size-Exclusion Chromatography Standards For calibrating SEC columns during protein purification. Bio-Rad #1511901.

Challenges in Traditional Enzyme Discovery and Identification

The discovery of novel enzymes, particularly within the Fe(II)/α-ketoglutarate (αKG)-dependent dioxygenase superfamily, is pivotal for advancing research in metabolism, drug discovery, and biocatalysis. Traditional methods for enzyme discovery are often slow, biased, and inefficient, creating bottlenecks for progress. This application note frames these challenges within the context of a broader thesis on the development of the CATNIP (Computational Analysis Tool for Novel enzyme Identification and Prediction) tool, which aims to accelerate the in silico prediction and prioritization of αKG FeII-dependent enzymes.

Key Challenges in Traditional Approaches

Challenge Category Description Quantitative Impact
Sequence-Based Screening Bias Reliance on sequence homology (e.g., BLAST) fails to identify functionally novel enzymes with low sequence similarity to known families. <30% sequence identity often yields no significant hits, missing potential novel clades.
Functional Assay Throughput Low-throughput activity screens (spectrophotometric, HPLC) limit the number of candidate genes or environmental samples that can be tested. Typical microplate-based assays screen 10^2-10^3 variants/week vs. metagenomic libraries containing 10^6-10^9 genes.
Expression & Solubility Issues Heterologous expression of novel enzymes, especially from extremophiles or with complex cofactor requirements, often leads to insoluble protein or inclusion bodies. ~40-60% of recombinant prokaryotic proteins express insolubly in E. coli systems.
Cofactor Dependency Screening In vitro assays must be reconstituted with specific cofactors (FeII, αKG, ascorbate). Incomplete optimization leads to false negatives. Activity can be reduced by >90% if ascorbate (a reducing agent) is omitted from the reaction buffer.
Metagenomic Analysis Complexity Functional screening of complex environmental DNA (eDNA) libraries is hampered by host biases, small insert sizes, and low probability of functional expression. <0.1% of clones in a soil metagenomic library typically show activity on a given substrate.

Protocol 1: Traditional Activity-Guided Purification from Microbial Culture

Objective: To isolate and identify a novel αKG FeII-dependent enzyme from a native microbial source. Materials: See Research Reagent Solutions table. Procedure:

  • Culture & Induction: Grow target organism (e.g., Streptomyces sp.) in appropriate liquid medium at 30°C, 200 rpm. Induce secondary metabolism if required.
  • Cell Lysis: Harvest cells by centrifugation (4,000 x g, 20 min). Resuspend pellet in Lysis Buffer. Lyse via sonication (10 cycles of 30 sec pulse, 59 sec rest on ice) or French press.
  • Crude Extract Preparation: Clarify lysate by centrifugation (16,000 x g, 45 min, 4°C). Retain supernatant (soluble protein fraction).
  • Ammonium Sulfate Precipitation: Gradually add solid (NH4)2SO4 to 70% saturation on ice. Stir for 1 hr. Pellet precipitate by centrifugation (12,000 x g, 30 min).
  • Column Chromatography Series:
    • Size Exclusion: Resuspend pellet in SEC Buffer. Load onto HiPrep Sephacryl S-200 HR column. Elute with isocratic flow. Collect fractions.
    • Anion Exchange: Pool active fractions, dialyze into AIEX Buffer A. Load onto HiTrap Q HP column. Elute with a 0-100% gradient of AIEX Buffer B over 20 column volumes.
    • Hydrophobic Interaction: Adjust pooled active fractions to 1M (NH4)2SO4. Load onto HiTrap Phenyl HP column. Elute with a descending salt gradient.
  • Activity Assay: After each step, assay 50 µL of fraction with 200 µM substrate, 100 µM αKG, 50 µM Fe(NH4)2(SO4)2, 1 mM ascorbate in Assay Buffer. Incubate at 30°C for 30 min. Stop with 10 µL 2M HCl. Analyze product formation by HPLC-MS.
  • Identification: Pool pure active fraction, run on SDS-PAGE. Excise dominant band for tryptic digest and LC-MS/MS protein identification.

Protocol 2: Functional Screening of a Metagenomic Library

Objective: To identify novel enzyme-encoding genes directly from environmental DNA via phenotypic screening. Procedure:

  • Library Construction: Extract high-molecular-weight eDNA from soil sample. Partially digest, size-fractionate (30-50 kb fragments), and clone into a fosmid or BAC vector. Transform into E. coli EPI300 cells.
  • High-Throughput Replica Screening: Plate transformations on LB agar with appropriate antibiotic. Using a 384-pin replicator, transfer colonies to:
    • Master plate (for archive).
    • Screening plates containing minimal media agar supplemented with target substrate (e.g., a specific alkaloid) as potential sole carbon/nitrogen source or indicator agar.
  • Lysate Preparation: For positive clones, inoculate 96-deep-well plates with 1 mL TB medium per well. Grow to saturation, pellet cells, and lyse with BugBuster Master Mix.
  • Microplate Activity Assay: In a 96-well plate, combine 30 µL lysate supernatant, 70 µL of Assay Mix (as in Protocol 1, step 6). Incubate 1 hr at 30°C. Measure absorbance/fluorescence specific to product formation.
  • Hit Validation & Sequencing: Re-test positive clones from primary screen. Isolate fosmid/BAC DNA from validated hits and perform end-sequencing or full insert sequencing.

Visualizations

G Traditional Traditional Discovery Workflow S1 Sample Collection (Environmental/Organism) Traditional->S1 S2 Activity-Based Screening (Low-Throughput Assays) S1->S2 S3 Protein Purification (Multi-Step Chromatography) S2->S3 C1 Challenge: High Cost & Time S2->C1 S4 Protein Identification (Edman/MS) S3->S4 C2 Challenge: Expression Bias S3->C2 S5 Gene Cloning & Sequencing S4->S5 C3 Challenge: Sequence Bias S4->C3

Title: Traditional Enzyme Discovery Workflow and Bottlenecks

G CATNIP CATNIP Tool Input F1 HMMER3 Search (vs. Custom αKG FeII HMMs) CATNIP->F1 F2 Structural Feature Prediction (Jpred, AlphaFold2) F1->F2 F3 Cofactor Binding Site Analysis (Metal & αKG motif check) F2->F3 F4 Phylogenetic Dispersion F3->F4 F5 Priority Ranked List of Candidate Enzymes F4->F5

Title: CATNIP Tool Prediction and Prioritization Pipeline

Research Reagent Solutions

Item Function in Protocol
Fe(NH4)2(SO4)2·6H2O Source of Ferrous iron (FeII), the essential redox-active cofactor.
Sodium Ascorbate Reducing agent to maintain FeII in its active state and prevent oxidation.
α-Ketoglutaric Acid Essential co-substrate; undergoes oxidative decarboxylation during reaction.
Hepes Buffer (pH 7.0) Non-coordinating buffer preferred for metalloenzyme assays.
BugBuster Master Mix Reagent for rapid, mild lysis of E. coli in high-throughput screens.
HiTrap Column Series Pre-packed chromatography columns for fast protein purification (IEX, HIC).
pCC1FOS / CopyControl Vector Fosmid vector for stable, single-copy maintenance of large eDNA inserts.
E. coli EPI300 Strain optimized for large fosmid/BAC replication and stability.

Within the broader thesis on computational enzymology, CATNIP (Computational Analysis for Terpene and Non-heme Iron-dependent Enzyme Prediction) is introduced as a dedicated in silico tool to address the substrate prediction challenge for α-ketoglutarate (αKG/2OG)-dependent non-heme iron (Fe(II)) enzymes. This diverse superfamily catalyzes hydroxylation, halogenation, and ring formation reactions critical in natural product biosynthesis and human biology. Accurately predicting their native substrates from sequence or structure alone remains a significant bottleneck. CATNIP integrates machine learning with biophysical simulations to bridge this "prediction gap," enabling researchers to annotate novel enzymes and engineer biocatalysts for drug development.

Application Notes & Key Data

Table 1: CATNIP Performance Benchmark Against Prior Tools

Metric CATNIP (v1.2) BLAST-Based Annotation Structure-Based Docking (AutoDock Vina)
Prediction Accuracy (Top-1) 89.3% 47.1% 62.5%
False Positive Rate 5.2% 28.6% 22.4%
Avg. Runtime per Prediction 12 min 2 sec 45 min
Key Input Requirement Sequence + (optional) SAXS data Sequence only High-resolution structure
Primary Strength Integrated functional motif & binding pocket dynamics Speed Visual interaction mapping

Table 2: Experimental Validation of CATNIP Predictions for Putative Oxygenases

Enzyme (UniProt ID) CATNIP-Predicted Primary Substrate Experimental Assay Result (Product Identified) Km (μM) kcat (s⁻¹)
HypX (Q8ABC4) L-pros-methyl ester L-pros-methyl ester hydroxylase 12.4 ± 2.1 1.05 ± 0.11
NovO (P0C1B9) 4-hydroxyphenylpyruvate Halogenase (Chlorination) 8.7 ± 1.5 0.78 ± 0.09
Putative OG-FeII_1 (A0A1B2C3D4) Flavone Flavone 3-hydroxylase 21.3 ± 3.8 0.31 ± 0.05

Detailed Experimental Protocols

Protocol 1: In Vitro Validation of CATNIP-Predicted Substrate Objective: To biochemically validate the top substrate prediction for a putative αKG-Fe(II) oxygenase.

Materials: Purified enzyme, predicted substrate, α-ketoglutarate, Fe(II) ammonium sulfate, L-ascorbic acid, catalase, reaction buffer (50 mM HEPES, pH 7.5), quenching solution (1% formic acid in MeOH), UPLC-MS system.

Procedure:

  • Reaction Setup: In a 100 μL reaction volume, combine:
    • 50 mM HEPES buffer (pH 7.5)
    • 100 μM enzyme
    • 200 μM predicted substrate (from 10 mM DMSO stock)
    • 1 mM α-ketoglutarate
    • 100 μM Fe(II) ammonium sulfate (freshly prepared)
    • 2 mM L-ascorbic acid
    • 100 U/mL catalase
  • Incubation: Initiate reaction by adding Fe(II). Incubate at relevant temperature (e.g., 30°C) for 30 minutes.
  • Quenching: Add 100 μL of ice-cold quenching solution. Vortex and incubate on ice for 10 min.
  • Precipitation: Centrifuge at 16,000 x g for 15 min at 4°C.
  • Analysis: Transfer supernatant for UPLC-MS analysis. Use a C18 column and a gradient of water/acetonitrile (+0.1% formic acid). Monitor for substrate consumption and product formation via MS/MS, comparing to negative controls (no enzyme, no Fe(II), or no αKG).

Protocol 2: CATNIP-Assisted Site-Directed Mutagenesis for Altered Selectivity Objective: To rationally alter enzyme regioselectivity based on CATNIP's binding pocket analysis.

Procedure:

  • Hotspot Identification: Run CATNIP's "Pocket Dynamics" module on your enzyme of interest. Identify residues within 5Å of the predicted substrate's proposed modification site.
  • Residue Scoring: CATNIP outputs a "Selectivity Influence Score" (SIS) for each pocket residue. Select 2-3 residues with the highest SIS for mutagenesis.
  • Mutant Design: Design primers to mutate selected residues to alanine (loss-of-function) or to residues with contrasting chemical properties (e.g., Asp to Lys).
  • Expression & Purification: Express and purify mutant proteins as per standard protocols for the wild-type enzyme.
  • Kinetic Profiling: Perform kinetic assays (as in Protocol 1) with the wild-type substrate and potential alternative substrates. Calculate new kinetic parameters (Km, kcat) to quantify altered selectivity.

Diagrams (Generated via Graphviz DOT Language)

G A Input: Enzyme Sequence/SAXS Data B CATNIP Computational Pipeline A->B B1 1. Motif & Fold Recognition B->B1 B2 2. Binding Pocket Geometry Prediction B1->B2 B3 3. Molecular Dynamics & Docking Simulation B2->B3 B4 4. Machine Learning Scoring & Ranking B3->B4 C Output: Ranked List of Probable Substrates B4->C D Experimental Validation (In Vitro Assay) C->D E Validated Function (Annotation/Engineering) D->E

Title: CATNIP Workflow from Sequence to Validated Function

G S Substrate (RH) E Enzyme-Fe(II)-αKG Complex S->E Binding I Fe(IV)=O (Oxyl Radical) E->I O2 Activation (Decarboxylation) P Product (ROH) I->P H-Atom Abstraction & Radical Rebound Succ Succinate + CO2 I->Succ Generated

Title: Core Catalytic Cycle of αKG-Fe(II) Oxygenases

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for CATNIP-Guided Research

Reagent/Material Function in Research Key Consideration
Fe(II) Ammonium Sulfate [(NH₄)₂Fe(SO₄)₂·6H₂O] Source of catalytically essential ferrous iron. Prepare fresh in degassed, acidic water to prevent oxidation to Fe(III).
Sodium Ascorbate / L-Ascorbic Acid Reducing agent to maintain iron in Fe(II) state. Include in all assay buffers; concentration typically 1-5 mM.
Catalase (from bovine liver) Scavenges deleterious H₂O₂ generated by uncoupled reaction cycles. Critical for improving coupling efficiency and yield.
α-Ketoglutarate (Sodium Salt) Essential co-substrate; provides the oxidizing equivalent for O₂ activation. Use in excess (typically 1-5 mM) relative to primary substrate.
Deuterated Solvents (e.g., D₂O, CD₃OD) For NMR-based assays to monitor reaction progress and regioselectivity. Enables direct observation of hydroxylation sites.
HisTrap HP Column (Ni Sepharose) Standardized purification of His-tagged recombinant αKG-Fe(II) enzymes. Ensures high-purity, active enzyme for kinetic studies.
Quenching Solution (1% Formic Acid in MeOH) Rapidly stops enzymatic reaction, denatures protein, and prepares samples for LC-MS. Acidification stabilizes labile products and prevents non-enzymatic oxidation.

Application Notes

The CATNIP (Conserved Active-site Topology for Network Informed Prediction) tool is a novel computational framework designed to identify and classify members of the alpha-ketoglutarate (α-KG) and Fe(II)-dependent dioxygenase superfamily. This superfamily is pivotal in diverse biological processes, including hypoxic sensing, epigenetic regulation, and collagen biosynthesis, making it a high-value target for therapeutic intervention in cancer, anemia, and fibrosis. The core algorithm of CATNIP leverages highly conserved patterns of residues that form the enzyme's active site to predict novel family members and infer potential function, even in the absence of high overall sequence homology.

The algorithm operates on the principle that while primary sequences within this superfamily may diverge, the three-dimensional spatial arrangement of catalytic residues—the "active-site signature"—is preserved. This signature includes the canonical His-X-Asp...His motif that coordinates the Fe(II) ion, along with residues responsible for α-KG and substrate binding. CATNIP employs a structural bioinformatics pipeline to extract these conserved patterns from known crystal structures, creates a probabilistic model of their spatial relationships, and scans proteomic data to identify proteins containing matching topologies.

Recent validation studies, integrating data from AlphaFold2 structural predictions and metagenomic sequencing, demonstrate CATNIP's precision. The tool successfully identifies previously annotated enzymes with >98% sensitivity and has uncovered numerous hypothetical proteins as putative novel dioxygenases, expanding the known landscape of this enzymatically diverse family.

Table 1: CATNIP Algorithm Performance Metrics (Validation Study)

Metric Value Description
Sensitivity 98.2% Proportion of known α-KG/Fe(II) dioxygenases correctly identified.
Specificity 99.7% Proportion of proteins from unrelated families correctly rejected.
Novel Predictions 347 Number of uncharacterized proteins flagged as high-confidence family members in the human proteome.
Avg. Runtime 4.7 min/proteome Time to scan a standard eukaryotic proteome (∼20k proteins).
Dependence on Global Sequence Identity < 25% Can make accurate predictions even when overall sequence identity to known members is low.

Protocols

Protocol 1: Active-Site Signature Compilation and Model Building

Objective: To construct the conserved residue pattern model that serves as the primary search query for CATNIP.

Materials:

  • Research Reagent Solutions & Essential Materials:
    • PDB Database: Source of high-resolution crystal structures (≤2.2 Å) of confirmed α-KG/Fe(II) dependent dioxygenases (e.g., PHD2, ALKBH5, collagen prolyl hydroxylase).
    • Multiple Sequence Alignment (MSA) Tool: ClustalOmega or MAFFT for aligning homologous sequences.
    • Molecular Visualization Software: PyMOL or UCSF Chimera for active-site residue identification and distance measurements.
    • Scripting Environment: Python 3.8+ with Biopython and NumPy libraries for data processing.

Procedure:

  • Curate a Non-Redundant Structure Set: Compile a list of known enzymes from the target superfamily. Download all available structures from the Protein Data Bank (PDB). Filter to remove duplicates and retain only the highest-resolution structure for each unique enzyme.
  • Identify Canonical Residues: For each structure, manually or via script, identify the atoms of the catalytic Fe(II), the coordinating residues (typically two histidines and one aspartate), and key α-KG binding residues (often an arginine and a serine/threonine).
  • Measure Spatial Relationships: Calculate all pairwise distances between the Cα (or relevant functional) atoms of the identified conserved residues. Record distances in Angstroms.
  • Build Probabilistic Model: Compile all distance measurements. For each residue pair, calculate the mean distance and standard deviation. The final model is a set of residues (nodes) with a matrix of expected distances and tolerances (edges) between them.

Protocol 2: Proteome-Wide Scanning with CATNIP

Objective: To apply the CATNIP model to a full proteome for the identification of novel α-KG/Fe(II) dependent dioxygenases.

Materials:

  • Research Reagent Solutions & Essential Materials:
    • Target Proteome: FASTA file of protein sequences for the organism of interest (e.g., from UniProt).
    • CATNIP Software Suite: Locally installed command-line tool or web server access.
    • Pre-computed Alphafold2 Models: (Optional but recommended) Database of predicted structures (e.g., from AlphaFold Protein Structure Database) corresponding to the target proteome.
    • High-Performance Computing (HPC) Cluster: Recommended for large proteomes or metagenomic datasets.

Procedure:

  • Input Preparation: Format the target proteome FASTA file. Ensure the CATNIP active-site model file is in the correct directory.
  • Execute Scan: Run the CATNIP command: catnip_scan --proteome target.fasta --model active_site_model.catnip --output predictions.tsv.
  • Analysis of Results: The output file (predictions.tsv) will list proteins ranked by a CATNIP score (0-1). Proteins scoring above a defined threshold (typically >0.85) are considered high-confidence hits. For these hits, review the aligned residue positions against the canonical model.
  • Structural Validation: For high-confidence hits, fetch or generate a 3D structure (e.g., from AlphaFold DB). Visually inspect the predicted spatial arrangement of the CATNIP-identified residues to confirm the conserved active-site topology.

Diagrams

G CATNIP Algorithm Workflow PDB PDB Structures (Known Enzymes) M1 1. Active-Site Residue Extraction PDB->M1 M2 2. Spatial Distance Measurement M1->M2 M3 3. Probabilistic Model Creation M2->M3 Model Active-Site Residue Pattern Model M3->Model S1 4. Sequence-Structure Mapping Model->S1 Proteome Target Proteome (FASTA) Proteome->S1 S2 5. Topology Scoring S1->S2 S3 6. Hit Ranking & Validation S2->S3 Output High-Confidence Novel Enzyme Predictions S3->Output

Title: CATNIP Algorithm Workflow

G Conserved Active-Site Topology Model H1 His 1 (Fe Ligand) D Asp (Fe Ligand) H1->D d=5.2±0.3Å H2 His 2 (Fe Ligand) D->H2 d=8.1±0.5Å R Arg (α-KG Binding) S Ser/Thr (α-KG Binding) R->S d=7.5±0.6Å Fe Fe(II) Fe->H1 Fe->D Fe->H2

Title: Conserved Active-Site Topology Model

Step-by-Step Guide: How to Use CATNIP for Gene Cluster Analysis and Prediction

Alpha-ketoglutarate (alpha-KG) and Fe(II)-dependent enzymes, including Jumonji-C domain-containing histone demethylases (KDMs) and prolyl hydroxylases (PHDs), are critical therapeutic targets in oncology and other diseases. The CATNIP (Computational Analysis Toolkit for Natural Product-Inspired Predictions) tool enables researchers to predict substrates, inhibitors, and binding modes for this enzyme class. This protocol outlines the two primary access methods for CATNIP, framed within a broader thesis on advancing ligand discovery for these targets.

Access Options: Comparative Analysis

Table 1: Comparison of CATNIP Access Methods

Feature Web Server Local Installation
Access URL https://catnip.cmdm.tw N/A (Localhost)
System Requirements Modern Web Browser Linux/Unix, 8GB+ RAM, 50GB+ Disk
Setup Complexity None (Instant) High (Requires dependencies)
Data Privacy Medium (Uploaded data transient) High (Complete local control)
Processing Speed Subject to queue/network High (Dedicated resources)
Cost Free for academic use Free, but requires hardware
Best For Single queries, quick checks High-throughput screening, proprietary data
Updates Automatic Manual (User must upgrade)
Key Dependency Internet connection Conda, Docker, Python 3.8+, RDKit, PyTorch

Detailed Access Protocols

Protocol A: Accessing the CATNIP Web Server

Objective: To perform a single prediction for a novel compound against a target alpha-KG/Fe(II) enzyme (e.g., KDM4A) using the public web server.

Materials (Research Reagent Solutions):

  • Query Compound: SMILES string or SDF/MOL2 file of the candidate molecule.
  • Target Enzyme PDB ID: (e.g., 2OQ6 for KDM4A).
  • Workstation: Computer with internet access and a web browser (Chrome/Firefox recommended).

Procedure:

  • Navigate to the CATNIP web server at https://catnip.cmdm.tw.
  • On the submission page, input the SMILES string of your compound or upload the molecular file.
  • In the Target Selection field, specify the PDB ID of the enzyme or upload a custom protein structure file.
  • Configure calculation parameters:
    • Set Docking Mode to Flexible.
    • Set Number of Poses to 20.
    • Enable Alpha-KG Cofactor Placement.
  • Submit the job by clicking Run CATNIP. You will receive a job ID.
  • Monitor job status on the Queue page. Typical runtime is 15-45 minutes.
  • Upon completion, download the results package containing:
    • Predicted binding pose (PDB format).
    • Interaction fingerprint report.
    • Predicted binding affinity (ΔG in kcal/mol).

Protocol B: Local Installation of CATNIP

Objective: To install CATNIP locally for batch processing of a compound library against multiple Fe(II)-dependent enzyme targets.

Materials (Research Reagent Solutions):

  • Hardware: Linux server (Ubuntu 20.04 LTS+) with minimum 8-core CPU, 16GB RAM, 100GB SSD.
  • Software Dependencies: Miniconda/Anaconda, Docker Engine (optional but recommended).
  • Data: Local database of compound structures (SDF), library of enzyme structures (PDB).

Procedure:

  • Prerequisite Installation:

  • Clone CATNIP Repository:

  • Install via Docker (Recommended):

  • Alternative Installation via Conda:

  • Validate Installation:

  • Run Batch Prediction: Prepare a CSV job file (batch_job.csv) with columns: compound_id, smiles, target_pdb. Execute:

Experimental Workflow Visualization

G Start Start: Research Question (Identify Novel KDM Inhibitor) AccessDecision Access Method Decision Start->AccessDecision WebServer Web Server Path AccessDecision->WebServer Single Molecule Public Data LocalInstall Local Installation Path AccessDecision->LocalInstall Large Library Proprietary Data InputPrep Input Preparation (SMILES, Target PDB) WebServer->InputPrep LocalInstall->InputPrep SubmitWeb Submit Job via Browser (Queue) InputPrep->SubmitWeb RunLocal Run Batch Script (CLI) InputPrep->RunLocal Results Analysis of Results: Poses, Affinity, Interactions SubmitWeb->Results RunLocal->Results ThesisIntegration Integrate into Thesis: Validate Prediction w/ Experimental Assay Results->ThesisIntegration

Title: CATNIP Access Decision & Research Workflow

Key Research Reagent Solutions for Validation Experiments

Table 2: Essential Materials for Validating CATNIP Predictions

Item Name Function in Alpha-KG/Fe(II) Enzyme Research Example/Supplier
Recombinant Enzyme Purified target protein for binding/activity assays. KDM4A (BPS Bioscience, #50107)
Alpha-KG Cofactor Essential co-substrate for enzymatic reaction. α-Ketoglutaric acid disodium salt (Sigma-Aldrich, #75890)
Ascorbic Acid Reductant to maintain Fe(II) in active state. L-Ascorbic acid (Sigma-Aldrich, #A4544)
(NH₄)₂Fe(SO₄)₂·6H₂O Source of Fe(II) ions for reconstituting active enzyme. Ferrous ammonium sulfate (Sigma-Aldrich, #F1543)
Fluorometric Assay Kit Measure demethylase/hydroxylase activity (e.g., via formaldehyde detection). JMJD Assay Kit (Cayman Chemical, #600170)
Crystallization Screen For obtaining protein-ligand complex structures to validate poses. Morpheus HT-96 Screen (Molecular Dimensions, MD1-46)
HDAC/Non-Jumonji Control Enzyme control to assess selectivity of predicted compounds. HDAC1 (BPS Bioscience, #50051)
Reference Inhibitor Positive control for inhibition assays (e.g., IOX1 for KDMs). IOX1 (MedChemExpress, #HY-13918)

Accurate prediction of Alpha-Ketoglutarate (AKG) Fe(II)-dependent enzymes, a diverse superfamily including prolyl hydroxylases, lysine demethylases, and nucleic acid demethylases, is crucial for understanding cellular metabolism, hypoxia signaling, and epigenetic regulation. The CATNIP (Computational Analysis Tool for Non-heme Iron Proteins) framework leverages machine learning to identify and characterize these enzymes from genomic data. The precision of CATNIP's predictions is fundamentally dependent on the quality and appropriateness of the input data—primarily genomic sequences and their associated protein identifiers. This protocol details the standardized acquisition, validation, and formatting of these inputs to ensure reproducible and high-confidence results for researchers and drug development professionals targeting these enzymes for therapeutic intervention.

Sourcing and Validating Genomic Data

2.1 Primary Data Sources Current genomic and proteomic data should be retrieved from authoritative, regularly updated repositories. Key sources include:

Table 1: Primary Genomic & Proteomic Data Sources

Repository Name Data Type Primary Use Update Frequency
NCBI RefSeq Genomic DNA, Protein Gold-standard reference sequences for well-annotated organisms. Daily
Ensembl Genomic DNA, Protein Comprehensive annotation for eukaryotic genomes, including alternative transcripts. ~1-2 months
UniProtKB (Swiss-Prot) Protein Sequences & IDs Expertly curated, non-redundant protein sequences with high-quality functional annotation. Weekly
UniProtKB (TrEMBL) Protein Sequences & IDs Computationally annotated supplement to Swiss-Prot. Weekly

2.2 Protocol: Retrieving and Validating a Target Gene Set

  • Objective: Obtain a high-confidence set of protein-coding sequences for AKG-dependent enzyme prediction.
  • Steps:
    • Define Organism(s): Identify the target organism(s) (e.g., Homo sapiens, Mus musculus).
    • Acquire Reference Proteome: Download the "Reference Proteome" or "Complete Proteome" FASTA file from UniProtKB. Filter for the "Swiss-Prot" subset to ensure curated entries.
    • Cross-Reference Identifiers: Extract the list of protein identifiers (e.g., UniProt accession P4HA1_HUMAN, RefSeq NP_000848.1). Use the NCBI E-utilities API or the BioPython Entrez module to retrieve corresponding genomic locus and nucleotide sequences.
    • Sequence Quality Check: Verify sequences for:
      • Absence of ambiguous amino acids (e.g., 'X', 'J', 'O'). Replace or flag for exclusion.
      • Correct length (no premature stop codons in the mature peptide unless disease-related).
      • Presence of characteristic residue patterns (e.g., the HXD...H Fe(II)-binding motif).

Standardizing Protein Identifiers for CATNIP Input

CATNIP requires a consistent identifier schema to map predictions back to functional annotation databases.

Table 2: Essential Protein Identifier Types and Handling

Identifier Type Format Example Purpose in CATNIP Conversion Tool/Action
UniProtKB Accession P13674 (primary), P4HA1_HUMAN Primary input key; links to functional annotation. Retain as primary ID.
RefSeq Protein ID NP_000848.1 Confirms genomic context and alignment. Map via UniProtKB cross-reference table.
Gene Symbol P4HA1 User-friendly reporting and pathway analysis. Map via HGNC (human) or ortholog databases.
Ensembl Protein ID ENSP00000367469 Links to genomic variants and structures. Map via BioMart or cross-reference files.

3.1 Protocol: Creating a Unified Identifier Mapping Table

  • Start with your list of UniProtKB accessions.
  • Use the UniProtKB ID Mapping service (available via web or programmatically) to retrieve all corresponding identifiers (RefSeq, Ensembl, Gene Symbol).
  • Export results into a tab-delimited mapping table. This table is critical for interpreting CATNIP outputs in downstream analyses.

Preparing Sequence Files for CATNIP Analysis

4.1 Final Data Preparation Workflow The following diagram illustrates the end-to-end workflow for preparing input data for the CATNIP tool.

G cluster_0 Critical Checks Start Define Target Organism(s) Source Retrieve Reference Proteome (UniProtKB) Start->Source Validate Validate & Filter Sequences Source->Validate ID_Map Generate Identifier Mapping Table Validate->ID_Map Len Check Length Validate->Len Amb Remove Ambiguous Residues (X,J,O) Validate->Amb Format Format FASTA Headers for CATNIP ID_Map->Format End CATNIP Input Ready Format->End Motif Scan for HXD...H Motif (Optional)

Data Preparation Workflow for CATNIP

4.2 Protocol: Formatting the Input FASTA File CATNIP accepts a standard FASTA file with specifically formatted headers to integrate identifier mapping.

  • Start with the validated, filtered protein sequences in FASTA format.
  • Modify the FASTA header line to include key identifiers, separated by pipes (|): >primary_id|gene_symbol|description Example: >P4HA1_HUMAN|P4HA1|Prolyl 4-hydroxylase subunit alpha-1
  • Save the final file in plain text format (e.g., candidate_proteome.fasta).

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Genomic Data Preparation

Item / Reagent Solution Function / Purpose Example / Specification
BioPython Package Programmatic access to NCBI, UniProt, and local sequence parsing/manipulation. Bio.Entrez, Bio.SeqIO, Bio.SeqUtils modules.
UniProtKB ID Mapping Tool Resolves cross-database protein identifier mappings in batch. UniProt REST API endpoint: https://rest.uniprot.org/idmapping
Sequence Analysis Suite Local validation, motif scanning, and sequence statistics. EMBOSS pepstats, or custom Python scripts using regular expressions for motif finding (e.g., H.X.D.{15,40}H).
Reference Proteome FASTA High-quality, non-redundant starting set of protein sequences. Downloaded from: UniProt > Proteomes > "[Your Organism] Reference Proteome".
Identifier Mapping Table Master lookup table linking all database IDs for the target proteins. A TSV file with columns: UniProtAC, GeneSymbol, RefSeqNP, EnsemblProtein_ID.
CATNIP-Formatted FASTA File The final, validated input file for enzyme prediction. Headers formatted as `>ID Gene Description`; no line breaks within sequences.

Robust input data preparation forms the foundation of reliable in silico predictions with the CATNIP tool. By adhering to these protocols for sourcing genomic sequences, standardizing protein identifiers, and rigorously validating input data, researchers can ensure that subsequent predictions of AKG Fe(II)-dependent enzymes are accurate, interpretable, and directly linkable to existing biological knowledge—accelerating hypothesis generation and target prioritization in drug discovery.

In the context of predicting and characterizing alpha-ketoglutarate (α-KG)/Fe(II)-dependent dioxygenases using the CATNIP (Computational Analysis Tool for Non-heme Iron Protein) platform, the configuration of search parameters and filters is a critical initial step. This protocol details the methodology for running a predictive search, focusing on the precise tuning of inputs to maximize the accuracy and relevance of results for research in epigenetics, hypoxia signaling, and collagen biosynthesis—key areas for therapeutic targeting.

Core Search Parameter Configuration

The primary search identifies potential α-KG/Fe(II)-dependent enzyme sequences from genomic or metagenomic databases. Parameters must be set based on conserved catalytic motifs and structural features.

Table 1: Essential CATNIP Search Parameters for α-KG/Fe(II)-Dependent Enzyme Prediction

Parameter Recommended Setting Rationale & Impact
Primary Motif HXD/E...H (JmjC-domain) or D/E...H (DSBH fold) Targets the conserved Fe(II)-binding residues. Narrow setting reduces false positives.
E-value Threshold 1e-10 Balances sensitivity and specificity for distant homology detection.
Sequence Length Filter 250-400 amino acids (for JmjC) Excludes truncated sequences and unrelated large protein families.
Secondary Structure Prediction β-strand-rich (DSBH fold) inclusion Uses tools like PSIPRED to filter for the conserved double-stranded beta-helix core.
Co-factor Binding Residues Filter for Arg/Lys near active site Ensures potential for α-KG binding via residue charge complementarity.

Advanced Filtering Protocol

Post-initial search, advanced filters are applied to prioritize enzymes with predicted functional relevance.

Experimental Protocol 2.1: Substrate Pocket Profiling Filter

  • Input: Candidate sequence list from primary CATNIP search.
  • Method: Run each candidate through the FPocket algorithm to detect potential binding pockets.
  • Filter Criteria:
    • Identify the largest pocket adjacent to the predicted HXD/E...H motif.
    • Calculate the electrostatic potential of this pocket using APBS.
    • Retain candidates where the pocket exhibits a strong positive charge patch (for binding negatively charged substrates like histone peptides or DNA/RNA).
  • Output: A refined list of candidates with high probability of functional substrate binding.

Table 2: Quantitative Filtering Metrics and Thresholds

Filtering Stage Tool/Algorithm Key Metric Acceptance Threshold
Primary Search HMMER (via CATNIP) Sequence E-value ≤ 1e-10
Structure Refinement HHpred Template Modeling (TM)-score ≥ 0.5
Pocket Analysis FPocket Druggability Score ≥ 0.5
Electrostatics APBS (via PDB2PQR) Pocket Surface Potential (kT/e) ≥ +5

Workflow for a Targeted Prediction Run

G Start Input Query Sequence / Genome Database P1 Configure Core Search Parameters Start->P1 P2 Execute HMMER & Motif Scan P1->P2 P3 Apply Length & E-value Filters P2->P3 P4 Homology Modeling & Fold Recognition P3->P4 FilterDB Filtered-Out Sequences P3->FilterDB P5 Active Site Pocket Detection & Profiling P4->P5 P4->FilterDB P6 Electrostatic & Substrate Docking Filter P5->P6 P5->FilterDB End Final Ranked List of High-Confidence Hits P6->End P6->FilterDB

Diagram Title: CATNIP Prediction Configuration & Filtering Workflow

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents & Materials for Experimental Validation of CATNIP Predictions

Item Function in Validation Example/Notes
Recombinant Expression System Production of predicted enzyme for in vitro assays. E. coli BL21(DE3) with pET vector for His-tagged protein.
Fe(II) Source Essential co-factor for enzymatic activity. Ammonium iron(II) sulfate hexahydrate (freshly prepared in acidic solution to prevent oxidation).
α-Ketoglutarate Essential co-substrate for the reaction cycle. Sodium salt, dissolved in assay buffer immediately before use.
Ascorbate Reducing agent to maintain Fe(II) in its active state. L-Ascorbic acid, pH-stabilized.
Activity Assay Substrate Validates predicted enzyme function. For histone demethylases: methylated histone peptide (H3K9me3/2). For hydroxylases: synthetic collagen peptide.
Mass Spectrometry Kit Quantifies product formation (e.g., succinate, formaldehyde). Succinate Colorimetric Assay Kit; Formaldehyde Dehydrogenase-based Fluorometric Assay.
Structural Biology Reagents For crystallography of predicted active site. Hampton Research crystallization screens (Index, PEG/Ion).

Precise configuration of search parameters and sequential application of structural and biophysical filters within the CATNIP framework are paramount for generating high-quality predictions of novel α-KG/Fe(II)-dependent enzymes. This protocol standardizes the computational approach, providing a reliable pipeline for researchers aiming to identify new therapeutic targets within this mechanistically diverse and pharmacologically significant enzyme family.

Within a broader thesis on computational tools for alpha-ketoglutarate (AKG) FeII-dependent dioxygenase prediction, the CATNIP tool emerges as a critical resource. This application note provides detailed protocols for interpreting CATNIP outputs, enabling accurate functional annotation and supporting research in epigenetics, hypoxic signaling, and drug development targeting these enzymes.

Understanding Key Output Components

CATNIP (Computed Alpha-Ketoglutarate and Ten-Eleven Translocation [TET]/Jumonji Interaction Predictor) provides three primary outputs: prediction scores, sequence alignments, and hit annotations. These are generated via a consensus Hidden Markov Model (HMM) search integrating profiles for the double-stranded beta-helix (DSBH) fold and Fe(II)/AKG binding motifs.

Table 1: Interpretation of CATNIP Prediction Scores

Score Type Range Interpretation Biological Relevance
Overall Confidence 0.0 - 1.0 Probability of being an AKG FeII dioxygenase <0.4: Unlikely; 0.4-0.6: Potential; >0.6: High Confidence
DSBH Fold E-value >1e-10 Statistical significance of DSBH fold match Lower E-value indicates stronger fold conservation
Binding Site Score 0 - 100 Conservation of HxD...H FeII & R/K binding motifs Scores >70 indicate intact catalytic triad
Family Specific Z-score Variable Deviation from family-specific null model Z>3: Significant family membership

Table 2: CATNIP Hit Annotation Categories

Annotation Code Description Implication for Function
TET_FULL Full-length TET/JBP family match DNA demethylation activity likely
JMJD_FULL Jumonji C-domain family match Histone demethylation activity predicted
HYPOXIA_INDUCIBLE Contains LxxLAP motif Potential oxygen sensor (e.g., PHD, FIH)
STRUCTURAL_ONLY DSBH fold with low motif score Possibly inactive homolog or divergent function
PUTATIVE_NEW High scores but low sequence identity Candidate for novel subfamily characterization

Experimental Protocols

Protocol 1: Validating CATNIP Predictions via Sequence Alignment Analysis

Objective: To confirm the functional motifs identified by CATNIP. Materials: CATNIP output file, multiple sequence alignment software (e.g., Clustal Omega, MAFFT), known reference sequences from Pfam (PF13532, PF13682). Method:

  • Extract the top-scoring query sequence region identified by CATNIP.
  • Perform a multiple sequence alignment with at least five known family members (e.g., human TET1, JMJD2A, PHD2).
  • Visually inspect the alignment for conservation of the HXD/E...H Fe(II)-binding motif (positions ~30 residues apart) and the R/K residue for AKG binding.
  • Measure pairwise identity using the BLOSUM62 matrix. A validated hit should show >25% identity in the core DSBH region.

Protocol 2: Biochemical Prioritization of CATNIP Hits

Objective: To rank predicted enzymes for experimental characterization in drug discovery pipelines. Method:

  • From the CATNIP hit list, filter annotations for "TETFULL", "JMJDFULL", or "HYPOXIA_INDUCIBLE".
  • Calculate a composite priority score: (0.5 * Overall Confidence) + (0.3 * Binding Site Score/100) + (0.2 * (1 - log10(E-value)/10)).
  • Prioritize hits with a composite score >0.7 for recombinant protein expression.
  • Cross-reference with tissue-specific RNA-seq data (e.g., from GTEx) to identify hits with relevant expression patterns for your disease context (e.g., cancer, fibrosis).

Protocol 3: Structural Model Verification Workflow

Objective: To generate and assess a 3D model of the predicted enzyme. Materials: SWISS-MODEL or AlphaFold2 server, PyMOL, PDB database. Method:

  • Use the top CATNIP alignment hit as a template for homology modeling.
  • Generate a 3D model, focusing on the DSBH core.
  • Verify the spatial orientation of the predicted Fe(II)-binding residues (His, Asp/Glu) and the AKG-binding arginine/lysine. The Fe(II) ligands should be within 2.2-2.5 Å of a modeled iron atom.
  • Check for obstruction of the active site; a clear binding pocket supports likelihood of correct function.

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents for Validating AKG FeII Dioxygenase Predictions

Reagent / Material Function in Validation Key Consideration
Recombinant AKG (α-Ketoglutarate) Substrate for activity assays Use stable, cell-permeable forms (e.g., octyl-ester) for cellular assays
Fe(II) Chelators (e.g., 2,2'-Dipyridyl) Negative control to prove Fe(II)-dependence Reversible chelators allow rescue experiments with FeSO4
Succinate Detection Kit (Colorimetric) Measures reaction product (succinate) from AKG turnover Higher sensitivity than CO2 detection for initial screens
Pan-JHDM Histone Demethylase Inhibitor (e.g, JIB-04) Positive control for Jumonji family enzyme activity Confirms functional class in cell-based assays
Anti-5-Hydroxymethylcytosine (5hmC) Antibody Detects product of TET-family DNA demethylase activity Primary readout for TET enzyme validation
HIF-1α Reporter Cell Line Functional readout for hypoxia-inducible factor (HIF) prolyl hydroxylase activity Measures enzyme's role in oxygen sensing

CATNIP Analysis & Validation Workflow

G Input Query Protein Sequence CATNIP CATNIP HMM Analysis Input->CATNIP Scores Score & E-value Interpretation CATNIP->Scores Align Motif Alignment Inspection CATNIP->Align Annot Hit Family Annotation CATNIP->Annot Output Prioritized Hit List for Validation Scores->Output Align->Output Annot->Output Validate Experimental Validation Output->Validate Thesis Thesis Integration: AKG FeII Enzyme Prediction Validate->Thesis

Title: CATNIP Analysis & Validation Workflow

AKG FeII Dioxygenase Catalytic Core Motif

G DSBH Double-Stranded Beta-Helix (DSBH) Fold FeBind Fe(II)-Binding Motif H X D/E ... (~30 aa) H DSBH->FeBind Contains AKGBind AKG-Binding Residue (R or K) DSBH->AKGBind Contains Reaction Reaction: AKG + O2 + Substrate -> Succinate + CO2 + Modified Substrate FeBind->Reaction Catalyzes AKGBind->Reaction Binds Cofactor Substrate Specific Substrate (DNA/Histone/Protein) Substrate->Reaction Input

Title: AKG FeII Dioxygenase Catalytic Core Motif

Accurate interpretation of CATNIP scores, alignments, and annotations is fundamental to advancing a thesis on AKG FeII-dependent enzyme prediction. The structured tables and protocols provided here offer a reproducible framework for researchers to transition from in silico predictions to validated biological function, supporting target identification in drug development for cancer, anemia, and other diseases linked to these oxygen-sensing enzymes.

This Application Note details protocols for discovering biosynthetic gene clusters (BGCs) encoding novel natural products, framed within the ongoing thesis research on the CATNIP tool for predicting alpha-ketoglutarate (αKG) Fe(II)-dependent enzymes. These enzymes are crucial in the biosynthesis of diverse pharmacologically active scaffolds. The integration of genomic mining with functional prediction accelerates the identification of novel pathways for drug development.

Key Quantitative Data on BGC Discovery

Table 1: Prevalence of αKG Fe(II)-Dependent Enzymes in Major BGC Types

BGC Type (Predicted Product) % of BGCs Containing ≥1 αKG Fe(II) Enzyme Common Catalytic Role(s)
Non-Ribosomal Peptide (NRP) 32% Amino Acid Hydroxylation, Epimerization
Polyketide (PK) 28% Tailoring Reactions (e.g., Glycosylation, Halogenation)
Ribosomally synthesized and post-translationally modified peptide (RiPP) 41% C-H Activation, Cyclization
Terpene 15% Cyclization, Rearrangement
Hybrid (e.g., NRP-PK) 36% Diverse Tailoring Reactions

Table 2: Performance Metrics of Genome Mining Tools

Tool Name Primary Function Precision (BGC Detection) Recall (BGC Detection) αKG Fe(II) Enzyme Prediction Capability
antiSMASH 7.0 BGC Identification & Analysis 0.95 0.89 Basic PFAM-based annotation
DeepBGC BGC Identification (ML-based) 0.91 0.93 No
CATNIP (Thesis Tool) αKG Fe(II) Enzyme Prediction 0.96* 0.92* Core Function
PRISM 4 BGC Chemical Structure Prediction 0.88 0.85 Integrated from external annotations

*Preliminary validation data on a curated set of characterized enzymes.

Experimental Protocols

Protocol 1: Genomic DNA Extraction from Filamentous Actinobacteria

Purpose: Obtain high-molecular-weight, high-purity genomic DNA for sequencing and PCR. Materials: See Scientist's Toolkit. Procedure:

  • Grow strain in 50 mL liquid medium (e.g., TSB) for 3-5 days at 30°C, 220 rpm.
  • Harvest mycelia by centrifugation (4,000 x g, 10 min). Wash pellet twice with TE buffer (pH 8.0).
  • Resuspend pellet in 5 mL Lysozyme Solution (10 mg/mL in TE). Incubate at 37°C for 1 hour.
  • Add 0.5 mL 20% SDS and 50 µL Proteinase K (20 mg/mL). Mix gently. Incubate at 55°C for 2 hours.
  • Add 2 mL 5M NaCl and 1.5 mL CTAB/NaCl solution. Mix. Incubate at 65°C for 20 min.
  • Extract with equal volume chloroform:isoamyl alcohol (24:1). Centrifuge (7,500 x g, 15 min).
  • Transfer aqueous phase. Precipitate DNA with 0.6 vol isopropanol. Spool out DNA.
  • Wash DNA with 70% ethanol. Air-dry and dissolve in 200 µL nuclease-free water.
  • Assess purity (A260/A280 ~1.8) and integrity by agarose gel electrophoresis.

Protocol 2: In silico BGC Identification and αKG Fe(II) Enzyme Annotation

Purpose: Identify candidate BGCs and annotate potential αKG Fe(II)-dependent enzymes using a combined antiSMASH and CATNIP workflow. Materials: Linux workstation, antiSMASH, CATNIP tool, genomic FASTA file. Procedure:

  • Run antiSMASH: Execute on genomic FASTA with relaxed strictness.

  • Extract Protein Sequences: From the antiSMASH GBK output, extract all protein sequences within predicted BGC regions into a multi-FASTA file (bgc_proteins.faa).
  • Run CATNIP Prediction: Use the CATNIP tool to identify and classify αKG Fe(II) enzymes.

  • Data Integration: Merge CATNIP predictions with antiSMASH BGC location data. Prioritize BGCs containing CATNIP-predicted enzymes with high confidence scores (>0.85).

Protocol 3: Heterologous Expression of a Candidate BGC

Purpose: Activate a silent BGC by cloning and expressing it in a heterologous host (Streptomyces coelicolor CH999). Materials: Bacterial Artificial Chromosome (BAC) vector, E. coli ET12567/pUZ8002, S. coelicolor CH999 spore suspension. Procedure:

  • Clone BGC: Using Gibson Assembly, clone the ~50 kb BGC (amplified via Long-Range PCR) into a BAC vector linearized with appropriate restriction enzymes.
  • Transform into E. coli ET12567/pUZ8002: Introduce the recombinant BAC into the methylation-deficient E. coli donor strain via electroporation.
  • Conjugal Transfer: a. Grow donor E. coli with BAC and helper plasmid to OD600 ~0.5. b. Wash donor cells and mix with CH999 spores (heat-shocked at 50°C for 10 min) at a 1:10 ratio. c. Plate mixture on MS agar with 10 mM MgCl2. Incubate at 30°C for 16 hours. d. Overlay plate with 1 mL water containing nalidixic acid (25 µg/mL) and apramycin (50 µg/mL) to select for exconjugants. e. Incubate at 30°C for 5-7 days until exconjugants appear.
  • Fermentation & Metabolite Analysis: Grow exconjugants in R5 liquid medium for 7 days. Extract metabolites with ethyl acetate. Analyze by LC-MS.

Visualizations

G cluster_0 Genome Mining & CATNIP Analysis Workflow G Microbial Genome (FASTA) A antiSMASH 7.0 (BGC Detection) G->A P BGC Protein Sequences A->P C CATNIP Tool (αKG FeII Enzyme Prediction) P->C D Annotated BGCs with Predicted Tailoring Enzymes C->D H Heterologous Expression (e.g., S. coelicolor) D->H NP Novel Natural Product Isolation & Characterization H->NP

Workflow for Novel Natural Product Discovery

pathway Substrate Peptide/PK Backbone Enzyme αKG Fe(II) Dependent Enzyme (CATNIP Target) Substrate->Enzyme AlphaKG α-Ketoglutarate (αKG) AlphaKG->Enzyme Product Modified Scaffold (Hydroxylated, Cyclized, etc.) Enzyme->Product Succ Succinate Enzyme->Succ CO2 CO₂ Enzyme->CO2 O2 Molecular Oxygen (O₂) O2->Enzyme

αKG FeII Enzyme Catalytic Cycle

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Genome Mining & Pathway Activation

Item Function/Benefit Example Product/Catalog
High-Purity gDNA Extraction Kit Removes polysaccharides & RNase; critical for long-read sequencing. Qiagen Genomic-tip 100/G
Long-Range PCR Enzyme Mix Amplifies large BGC fragments (>20 kb) for cloning. Takara LA Taq
Gibson Assembly Master Mix Seamless, one-step assembly of multiple large DNA fragments. NEB Gibson Assembly HiFi 2X
Methylation-Deficient E. coli Donor Strain Essential for efficient conjugal transfer of DNA into actinomycetes. E. coli ET12567/pUZ8002
Linearized BAC Vector Stable maintenance of large BGC inserts in heterologous hosts. pCAP01, pESAC13
Actinomycete Spores Ready-to-use, standardized exconjugant generation. S. coelicolor CH999 Spores
HPLC-MS Grade Solvents High-purity solvents for metabolite extraction and analysis. Fisher Chemical Optima LC/MS
Broad-Spectrum Protease Inhibitor Cocktail Preserves enzyme activity in cell lysates for in vitro assays. Roche cOmplete Mini EDTA-free

Solving Common CATNIP Analysis Problems and Enhancing Prediction Accuracy

In the application of the CATNIP (Consensus Approach for Targeting and Identifying Primers) tool for the prediction of alpha-ketoglutarate (α-KG)/Fe(II)-dependent dioxygenase substrates, researchers frequently encounter low-score or ambiguous computational hits. These results challenge the differentiation between true enzymatic substrates and background noise. This protocol details systematic parameter adjustment strategies within the CATNIP framework to refine predictions, enhance specificity, and validate potential substrates for subsequent experimental interrogation in drug discovery pipelines.

Core Parameter Optimization Table

The following table summarizes key adjustable parameters in the CATNIP pipeline, their default settings, recommended adjustments for ambiguous hits, and the primary impact of each adjustment.

Table 1: CATNIP Parameter Adjustment Strategies for Ambiguous Hits

Parameter Category Default Value Recommended Adjustment for Low-Score Hits Rationale & Impact on Results
Sequence Identity Cutoff 30% Lower to 25-28% Increases sensitivity by capturing more distant homologs; may increase false positives.
Consensus Score Threshold 0.7 Lower to 0.5-0.65 Includes hits with weaker but convergent predictive signals from multiple algorithms.
Alignment Coverage (Query) 70% Increase to >80% Demands more complete structural domain alignment, improving hit relevance.
E-value Threshold (BlastP) 1e-5 Relax to 1e-3 or 1e-2 Broadens the search space; requires careful post-filtering.
Fe(II)-binding Motif Stringency Strict HXD/E...H Allow conserved substitutions (e.g., Q for H) Accommodates known variant motifs in subfamilies while preserving catalytic core.
α-KG Binding Pocket Residue Match 100% Match Allow ≥80% Match Permits analysis of enzymes with non-canonical co-substrate interactions.

Experimental Validation Protocol for Refined Hits

Following computational refinement, putative substrates require biochemical validation.

Protocol 1: In Vitro Dioxygenase Activity Assay for Validated CATNIP Hits

Objective: To experimentally confirm the predicted enzymatic activity of a refined α-KG/Fe(II)-dependent dioxygenase on its proposed substrate.

Research Reagent Solutions & Essential Materials:

  • Recombinant Enzyme: Purified, catalytically active α-KG-dependent dioxygenase expressed in E. coli.
  • Predicted Substrate: Chemically synthesized or purified putative substrate compound.
  • Reaction Buffer (50mM HEPES, pH 7.0): Maintains physiological pH for optimal enzyme activity.
  • Cofactor Solution (100µM Fe(II) (as (NH₄)₂Fe(SO₄)₂·6H₂O), 1mM α-KG): Provides essential metallic cofactor and primary co-substrate.
  • Ascorbate (2-5mM): Often included as a reducing agent to maintain Fe(II) in its active state.
  • Stop Solution (1% Formic Acid): Rapidly quenches the enzymatic reaction.
  • LC-MS/MS System (e.g., Q-Exactive Orbitrap): For high-resolution mass spectrometry analysis of substrate depletion and product formation.
  • Control Substrate (e.g., Known Histone Demethylase Peptide for JmjC enzymes): Provides a positive control for enzyme batch activity.

Methodology:

  • Reaction Setup: In a 50µL final volume, combine:
    • 50mM HEPES buffer (pH 7.0).
    • 100µM Fe(II).
    • 1mM α-KG.
    • 2mM Ascorbate.
    • 10-50µM Predicted Substrate.
    • 0.5-2µg Recombinant Enzyme (initiate reaction by addition).
  • Incubation: Incubate at relevant physiological temperature (e.g., 37°C) for 15-60 minutes.
  • Quenching: Add 50µL of ice-cold 1% formic acid to stop the reaction.
  • Sample Preparation: Centrifuge at 15,000 x g for 10 min to pellet precipitated protein. Transfer supernatant for LC-MS/MS analysis.
  • LC-MS/MS Analysis:
    • Chromatography: Use a reverse-phase C18 column. Employ a gradient from 5% to 95% acetonitrile in water (both with 0.1% formic acid) over 15 minutes.
    • Mass Spectrometry: Operate in positive ion mode. Perform full MS scans (m/z 150-2000) followed by data-dependent MS/MS scans on the most intense ions.
  • Data Analysis: Extract ion chromatograms (XIC) for the predicted substrate ([M+H]⁺) and the hypothesized product (expected mass shift: +16 Da for hydroxylation, -17 Da for demethylation, etc.). Quantify peak areas. A successful reaction is indicated by time-dependent substrate depletion and concomitant product formation in the enzyme-containing sample, absent in no-enzyme or heat-denatured enzyme controls.

Visualization of Workflow & Logic

G Start Initial CATNIP Run (Low-Score/Ambiguous Hits) P1 Parameter Adjustment Strategy Start->P1 T1 Lower Consensus Threshold? P1->T1 T2 Relax Motif Stringency? T1->T2 No A1 Broaden Search Increase Sensitivity T1->A1 Yes T3 Adjust Alignment Coverage? T2->T3 No T2->A1 Yes A2 Refine for Specificity T3->A2 Yes Integrate Integrate Refined Parameter Set T3->Integrate No A1->Integrate A2->Integrate Rerun Execute CATNIP with Adjusted Parameters Integrate->Rerun Output Refined, Ranked Hit List Rerun->Output Val Experimental Validation Output->Val

Diagram Title: Parameter Tuning Logic for CATNIP Ambiguity Resolution

G Sub Predicted Substrate (From CATNIP) Rxn In Vitro Reaction (37°C, 15-60 min) Sub->Rxn Enzyme α-KG/Fe(II)- Dependent Enzyme Enzyme->Rxn Cof Cofactors (Fe(II), α-KG, O₂) Cof->Rxn Quench Acid Quench & Protein Precipitation Rxn->Quench LCMS LC-MS/MS Analysis Quench->LCMS Data Data Analysis: - Substrate Depletion - Product Formation (+16 Da, etc.) LCMS->Data

Diagram Title: Experimental Validation Workflow for CATNIP Predictions

The CATNIP (Computational Analysis Tool for Non-Heme Iron Protein) prediction tool is designed to identify novel alpha-ketoglutarate (αKG) Fe(II)-dependent dioxygenases from expansive genomic and metagenomic datasets. These enzymes are pivotal in natural product biosynthesis, drug metabolism, and cellular signaling, representing high-value targets for drug development. Efficient handling of terabyte-scale genomic data is therefore not an ancillary concern but a core requirement for the tool's utility in this thesis research. This document outlines protocols and performance optimizations for managing such data throughout the CATNIP workflow.

Core Performance Challenges & Quantitative Benchmarks

Processing genomic data with CATNIP involves sequential computational heavy steps: data retrieval, quality control, gene calling, multiple sequence alignment, and finally, the machine learning-based prediction. Performance bottlenecks are consistently observed at the I/O and alignment stages.

Table 1: Performance Benchmarks for Key CATNIP Workflow Steps on a 1 TB Metagenomic Dataset

Workflow Step Software (Example) Resource Peak Execution Time (Baseline) Execution Time (Optimized) Key Bottleneck
Data QC & Trimming FastP 8 CPU, 16 GB RAM 4.5 hours 1.2 hours I/O Read/Write
Gene Calling Prodigal 32 CPU, 32 GB RAM 18 hours 5 hours CPU
Sequence Alignment HMMER3 (HMMSCAN) 48 CPU, 64 GB RAM 120+ hours 28 hours CPU & Memory
Feature Generation Custom Python Scripts 16 CPU, 128 GB RAM 6 hours 1.5 hours Memory & I/O
CATNIP Prediction TensorFlow Model 8 CPU, 1 GPU, 32 GB RAM 0.5 hours 0.1 hours GPU Memory

Table 2: Impact of File Format on I/O Performance

Format Compression Size (for 100 GB raw FASTA) Read Speed Recommended Use Case
FASTA (.fasta) None 100 GB Fast Intermediate processing
FASTQ (.fq) None ~300 GB Medium Raw sequence input
gzip (.gz) Gzip ~35 GB Slow Long-term storage, transfer
CRAM (.cram) Reference-based ~22 GB Fast Aligned read storage
HDF5 (.h5) Internal ~40 GB Very Fast Feature matrix storage

Detailed Experimental Protocols

Protocol 3.1: Efficient Data Acquisition and Pre-processing for CATNIP

Objective: To rapidly download, validate, and pre-filter large genomic datasets to reduce downstream computational load.

  • Parallelized Download:
    • Use aria2c or parallel with curl to download SRA datasets (e.g., from NCBI) using multiple connections.
    • Command: aria2c -x 16 -s 16 <ftp_url_of_sra_file>
  • Batch Conversion & Trimming:
    • Convert SRA to FASTQ using a tool like fasterq-dump (faster than fastq-dump) with --threads flag.
    • Implement batch quality trimming and adapter removal using fastp in multi-threaded mode, processing multiple samples simultaneously via GNU parallel.
    • Command: ls *.fastq | parallel -j 4 "fastp -i {} -o {}.clean.fq -j {}.json -h {}.html -w 16"
  • Initial Sequence Filtering:
    • Use seqtk to subset data or filter by length rapidly. For CATNIP, retaining sequences with homology to known dioxygenase domains (e.g., Pfam PF03171) at this stage can drastically reduce volume.
    • Command: seqtk subseq input.fq id_list.txt > output.fq

Protocol 3.2: Scalable Homology Search and Alignment

Objective: To perform sensitive homology searches against massive protein databases within a feasible timeframe.

  • Database Preparation:
    • Download and format custom databases (UniRef90, Pfam-A) in a high-performance format (e.g., use hmmpress for HMMER databases).
    • Store databases on a high-speed local SSD or RAM disk (/dev/shm) for repeated queries.
  • Distributed HMMER Search:
    • Split the query FASTA file into multiple chunks using faSplit.
    • Execute hmmsearch or hmmscan in parallel across a compute cluster (SLURM, SGE) or multi-core server.
    • Script Core:

  • Result Aggregation:
    • Concatenate results and parse using grep/awk or BioPython scripts, loading data into a Pandas DataFrame for subsequent feature extraction for CATNIP.

Protocol 3.3: CATNIP Model Training on Large Feature Sets

Objective: To train deep learning models on high-dimensional genomic feature data without memory overflow.

  • Data Loading Strategy:
    • Use TensorFlow tf.data.Dataset API or PyTorch DataLoader with custom iterators to stream data from HDF5 files on disk, rather than loading entire datasets into RAM.
  • Mixed Precision Training:
    • Enable mixed precision (tf.keras.mixed_precision.set_global_policy('mixed_float16')) to accelerate training and halve GPU memory usage, allowing for larger batch sizes.
  • Gradient Accumulation:
    • For very large models or feature vectors, simulate a larger effective batch size by accumulating gradients over several forward/backward passes before updating weights.

Visualization of Workflows

catnip_workflow Raw_Data Raw Genomic Data (SRA/FASTQ) QC Parallel QC & Trimming (fastp, seqtk) Raw_Data->QC TB-scale Gene_Calling Distributed Gene Calling (Prodigal) QC->Gene_Calling Filtered Reads HMM_Search Parallel HMM Search (HMMER3) Gene_Calling->HMM_Search Protein Sequences Feature_Extract Feature Extraction (Physicochemical, Evolutionary) HMM_Search->Feature_Extract Domain Hits CATNIP_Model CATNIP Prediction Model (Neural Network) Feature_Extract->CATNIP_Model Feature Matrix Output Novel αKG/FeII Enzyme Predictions & Rankings CATNIP_Model->Output

Diagram Title: CATNIP Large-Scale Data Processing Pipeline

Diagram Title: Computational Resource Allocation Map

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools & Resources for CATNIP-Based Research

Item/Category Specific Solution/Product Function in CATNIP Workflow
High-Performance Compute AWS EC2 (p3.2xlarge), GCP A2 VMs, or local server with NVIDIA A100/A40 GPU. Provides the necessary parallel CPUs and high-memory GPU for model training and large-scale alignments.
Job Scheduler SLURM, Apache Airflow, or Nextflow. Orchestrates and automates the multi-step CATNIP pipeline across compute clusters, managing dependencies.
Data Storage Lustre parallel filesystem, AWS S3/Google Cloud Storage with lifecycle policies. High-speed storage for active projects and cost-effective archival for raw genomic datasets.
Containerization Docker/Singularity images with Conda environments. Ensures reproducibility of the CATNIP software stack (Python, HMMER, Prodigal, etc.) across different systems.
Database Subscription UniProt, Pfam, and custom HMM databases. Curated sources of enzyme families for homology searches and training data generation.
Monitoring Grafana & Prometheus, htop, nvidia-smi. Real-time monitoring of cluster/node resource utilization (CPU, RAM, GPU, I/O) to identify bottlenecks.
Programming Library Biopython, Pandas, NumPy, TensorFlow/PyTorch, Dask. Core libraries for data parsing, feature engineering, and building the CATNIP prediction algorithm.

Within the broader thesis on the CATNIP (Computational Analysis Tool for Non-heme Iron Protein) prediction pipeline, a critical challenge is the high rate of false positive assignments for alpha-ketoglutarate (AKG)/Fe(II)-dependent enzymes. These oxygenases are pivotal in diverse biological processes, including hypoxia sensing, collagen biosynthesis, and epigenetic regulation, making their accurate identification essential for functional genomics and drug discovery. This document provides detailed application notes and protocols for the manual validation of candidate enzymes predicted by CATNIP, differentiating true positives from false positives through rigorous experimental and bioinformatic techniques.

Core Validation Strategy

The validation framework is built on a multi-tiered approach, converging evidence from sequence, structure, and biochemical function.

Table 1: Multi-Tier Validation Framework for AKG/FeII Enzymes

Tier Validation Aspect Key Techniques Expected Outcome for True Positives
1 Sequence & Motif Analysis HMM profiling, Residue co-occurrence check Presence of HxD...H iron-binding motif and other conserved active-site residues (e.g., R/K for AKG binding).
2 Structural Assessment Homology modeling, Active site cavity analysis Prediction of a double-stranded beta-helix (DSBH) or jelly-roll fold with a 2-His-1-carboxylate iron coordination.
3 Functional Biochemistry In vitro activity assay, Mass spectrometry AKG-dependent consumption of O₂ and substrate, coupled with succinate production.
4 Cellular Context Gene co-expression, Metabolic pathway mapping Co-expression with known pathway components or relevant substrate biosynthetic genes.

Detailed Experimental Protocols

Protocol:In VitroRadiometric Activity Assay for AKG/FeII Enzymes

This protocol measures the conversion of [1-¹⁴C]-labeled AKG to ¹⁴CO₂, a direct product of the decarboxylation reaction.

Materials:

  • Purified recombinant candidate enzyme.
  • Assay Buffer: 50 mM HEPES (pH 7.0), 150 mM NaCl.
  • Cofactor Solution: 100 µM Fe(II)(NH₄)₂(SO₄)₂, 2 mM L-ascorbate (freshly prepared).
  • Substrate Mix: 100 µM [1-¹⁴C]-AKG (specific activity ~0.1 µCi/µmol), 1 mM putative native substrate.
  • Stop Solution: 2 M H₂SO₄.
  • CO₂ Trapping System: Hyamine hydroxide-soaked filter paper in a suspended center well (Kontes).

Procedure:

  • Reaction Setup: In a sealed, rubber-stoppered reaction vial, combine 50 µL Assay Buffer, 10 µL Cofactor Solution, 10 µL Substrate Mix, and 20 µL of purified enzyme (or buffer for negative control). Pre-incubate at 25°C for 2 minutes.
  • Initiation & Incubation: Start the reaction by injecting the enzyme. Incubate at 25°C for 30 minutes with gentle agitation.
  • Termination & Capture: Inject 100 µL of Stop Solution through the stopper. Continue incubation for 60 minutes to allow complete release and capture of ¹⁴CO₂ by the hyamine hydroxide filter.
  • Quantification: Carefully remove the filter paper, place it in a scintillation vial with cocktail, and measure radioactivity by liquid scintillation counting.
  • Data Analysis: Calculate enzyme activity (nmol CO₂/min/mg). A true positive will show significant activity dependent on both Fe(II) and the putative substrate. Include controls lacking enzyme, Fe(II), or substrate.

Protocol: LC-MS/MS-Based Metabolite Profiling

This protocol validates function by directly measuring the consumption of AKG and production of succinate and the hydroxylated product.

Materials:

  • Quench Solution: 80% (v/v) methanol/water at -40°C.
  • LC-MS System: Reversed-phase C18 column (e.g., Zorbax SB-C18), coupled to a high-resolution mass spectrometer.
  • Mobile Phase A: 0.1% Formic acid in H₂O.
  • Mobile Phase B: 0.1% Formic acid in acetonitrile.

Procedure:

  • Reaction & Quenching: Perform a scaled-up reaction similar to 3.1, but with non-radiolabeled AKG. At timepoints (0, 5, 15, 30 min), remove 50 µL aliquot and immediately mix with 200 µL of cold Quench Solution. Centrifuge (16,000 x g, 10 min, 4°C) to pellet protein.
  • Sample Analysis: Inject supernatant onto the LC-MS/MS system. Use a gradient from 2% to 95% Mobile Phase B over 12 minutes.
  • Mass Spectrometry: Operate in negative electrospray ionization (ESI-) mode. Use targeted Selected Reaction Monitoring (SRM) or high-resolution full-scan modes.
    • Monitor for AKG (m/z 145.01 [M-H]⁻) and succinate (m/z 117.02 [M-H]⁻).
    • Monitor for predicted mass shift of the substrate (+15.9949 Da for one hydroxylation).
  • Validation: A true positive will show time-dependent depletion of AKG and substrate, with concomitant production of succinate and the hydroxylated product. Quantify using standard curves.

Visualization of Workflows and Relationships

G CATNIP CATNIP Prediction List FP False Positive Filter CATNIP->FP Tier1 Tier 1: Sequence & Motif FP->Tier1 Candidate Sequence Tier2 Tier 2: Structural Assessment Tier1->Tier2 Pass Reject Rejected False Positive Tier1->Reject Fail Tier3 Tier 3: Functional Biochemistry Tier2->Tier3 Pass Tier2->Reject Fail Tier4 Tier 4: Cellular Context Tier3->Tier4 Pass Tier3->Reject Fail Validate Validated AKG/FeII Enzyme Tier4->Validate Pass Tier4->Reject Fail

Diagram Title: Multi-tier Validation Workflow for CATNIP Predictions

G Sub Substrate (R-H) Complex Ternary Complex [Enz-Fe·AKG·Sub] Sub->Complex AKG α-Ketoglutarate (O₂C-C(O)-CH₂-CH₂-CO₂⁻) AKG->Complex O2 O₂ O2->Complex Activation EnzFe Enzyme-Fe(II) EnzFe->Complex Binding Succ Succinate (O₂C-CH₂-CH₂-CO₂⁻) Complex->Succ CO2 CO₂ Complex->CO2 Prod Hydroxylated Product (R-OH) Complex->Prod EnzFe_rest Enzyme-Fe(II) Complex->EnzFe_rest Turnover

Diagram Title: Catalytic Cycle of AKG/FeII Dioxygenases

The Scientist's Toolkit: Key Research Reagent Solutions

Table 2: Essential Reagents for AKG/FeII Enzyme Validation

Reagent/Material Function & Role in Validation Key Considerations
Fe(II)(NH₄)₂(SO₄)₂ Provides the essential Fe²⁺ cofactor for catalytic activity. Must be prepared fresh. Anoxia is recommended during stock preparation to prevent oxidation to Fe(III). Use in an ascorbate-containing buffer.
L-Ascorbate Acts as a reducing agent to maintain iron in the Fe(II) state and may assist in catalysis. Critical for sustaining activity in in vitro assays. Prepare fresh daily.
[1-¹⁴C]-α-Ketoglutarate Radiolabeled substrate enabling highly sensitive, direct measurement of the core decarboxylation reaction. The 1-¹⁴C label is released as ¹⁴CO₂, providing unambiguous evidence of enzymatic turnover.
High-Resolution Mass Spectrometer (e.g., Q-TOF) Enables untargeted discovery and targeted quantification of substrates and products (succinate, hydroxylated compound). Essential for confirming the exact chemical transformation, especially for novel substrates.
Anaerobic Chamber/Cuvette Maintains an oxygen-free environment for handling Fe(II) stocks and setting up sensitive reactions. Prevents rapid autoc oxidation of Fe(II) and allows precise control of O₂ introduction for kinetics.
Stable His-tag Purification System (Ni-NTA/Co²⁺) Standardized purification of recombinant candidate enzymes for biochemical assays. Ensures high yield and purity of protein required for reliable kinetic characterization.
Homology Modeling Software (e.g., SWISS-MODEL, AlphaFold2) Predicts 3D structure to assess the presence of the conserved DSBH fold and active site geometry. A predicted structure lacking the canonical Fe(II)-binding site is a strong false positive indicator.

The CATNIP (Computational Analysis Toolkit for α-Ketoglutarate Fe(II)-Dependent Enzymes Prediction) framework provides a foundational model for identifying and classifying aKG/Fe(II)-dependent enzymes, a superfamily with profound implications in epigenetics, metabolism, and hypoxia response. This application note details the essential subfamily-specific optimization protocols required to adapt the core CATNIP model for two critical target groups: the Jumonji C (JmjC) domain-containing histone demethylases and the Hypoxia-Inducible Factor Prolyl Hydroxylases (HIF-PHs). These protocols enable researchers to shift from broad classification to targeted prediction and inhibitor design.

Subfamily-Specific Structural & Functional Determinants

Successful model optimization requires retraining on features that distinguish these subfamilies within the broader aKG/Fe(II)-dependent enzyme family. The key quantitative discriminators are summarized below.

Table 1: Comparative Features of JmjC vs. HIF-PH Subfamilies

Feature JmjC Histone Demethylases HIF Prolyl Hydroxylases (EGLN1-3) Data Source (PMID)
Primary Biological Role Epigenetic regulation via histone lysine demethylation Oxygen sensing; targeting HIF-α for degradation 34140339, 20017987
Key Substrate Methylated histone tails (e.g., H3K9me3, H3K27me3) Hypoxia-Inducible Factor (HIF-α) proline residues 34140339, 15882621
Cofactor Requirement aKG, Fe(II), O₂ aKG, Fe(II), O₂ Consistent
Selectivity Determinant JmjC domain topology; reader domains for histone marks β2β3 loop conformation; C-terminal helix for HIF binding 25561780, 35862852
Representative Km for aKG (μM) 5 - 50 (varies by specific enzyme) 10 - 25 (for EGLN2) 22327296, 15882621
Inhibitor Scaffolds 8-Hydroxyquinolines, pyridine carboxylates Roxadustat, Molidustat, Vadadustat 35196442, 32320653

Experimental Protocols for Model Tuning & Validation

Protocol 3.1: Curating a Subfamily-Specific Training Dataset

Objective: To compile high-quality, non-redundant sequence and structural data for JmjC or HIF-PH enzymes. Materials:

  • Public databases (PDB, UniProt, BRENDA).
  • Sequence alignment software (Clustal Omega, MUSCLE).
  • Custom Python scripts for data parsing.

Method:

  • Data Retrieval: From UniProt, retrieve all reviewed human proteins annotated with "JmjC domain" (GO:0036113) or "HIF-PH activity" (GO:0101008). Include orthologs from model organisms (M. musculus, D. rerio) with ≥70% sequence identity.
  • Structure Mapping: Cross-reference with the PDB to obtain all unique X-ray crystallography structures with resolution ≤2.5 Å, bound to cofactor (aKG/Fe(II)) or inhibitors.
  • Negative Set Curation: For the CATNIP model, compile a negative set from other aKG-dependent subfamilies (e.g., collagen prolyl hydroxylases, TET enzymes) to teach discrimination.
  • Feature Extraction: For each entry, extract:
    • Sequence Features: Position-specific scoring matrix (PSSM) profiles, conserved motif (e.g., HXD/E...H iron-binding motif) variations.
    • Structural Features (if available): Active site volume (calculated with CASTp), β2β3 loop dihedral angles (for HIF-PHs), JmjC domain twist angle.
    • Functional Annotations: Substrate specificity, reported inhibitory constants (Ki).

Protocol 3.2: Active Site Pocket Pharmacophore Mapping

Objective: To define the 3D chemical feature constraints for virtual screening against a specific subfamily.

Materials:

  • Selected PDB structures (e.g., 4BIS for KDM4A, 5L9B for EGLN2).
  • Molecular modeling suite (Schrödinger Maestro, OpenEye).
  • Pharmacophore generation software (Phase, MOE).

Method:

  • Structure Preparation: Align 3-5 representative co-crystal structures (enzyme-inhibitor complex) of the target subfamily. Remove ligands and water, standardize protonation states.
  • Consensus Pocket Analysis: Superimpose active sites. Define a consensus binding pocket using the alpha spheres method.
  • Pharmacophore Feature Derivation: From the superimposed inhibitors, identify conserved features:
    • JmjC: Metal (Fe²⁺) coordination vector (H-bond acceptor), aKG-mimetic carboxylate, aromatic cage planar group.
    • HIF-PH: Fe²⁺ coordination site, aKG 5-carboxylate binding zone (ionic), hydrophobic tunnel for HIF peptide.
  • Model Validation: Use a decoy set to calculate enrichment factor (EF₁%). Tune feature tolerances to maximize EF₁%.

Protocol 3.3: Biochemical Validation of Predicted Inhibitors

Objective: To experimentally validate computational hits from the optimized CATNIP model.

Materials:

  • Purified recombinant JmjC (e.g., KDM4A) or HIF-PH (e.g., EGLN2) enzyme.
  • Substrates: Methylated histone peptide (for JmjC), HIF-α peptide (for HIF-PH).
  • Detection reagents: Anti-succinyl antibody (for JmjC), AlphaLISA or HPLC-MS assay kit.

Method (JmjC Demethylase Assay):

  • In a 50 μL reaction buffer (50 mM HEPES pH 7.5, 50 μM (NH₄)₂Fe(SO₄)₂, 1 mM aKG, 0.01% BSA), mix enzyme (10 nM) with test compound (0-100 μM) for 15 min.
  • Initiate reaction by adding substrate peptide (1 μM). Incubate at 25°C for 30 min.
  • Quench with 10 μL of 1% formic acid.
  • Detection (Option A - Immunoassay): Transfer quenched reaction to AlphaLISA plate. Follow manufacturer's protocol for detection of succinylated product.
  • Detection (Option B - LC-MS): Analyze by LC-MS to directly quantify substrate consumption and product formation (demethylated/succinylated peptide).
  • Calculate IC₅₀ values using non-linear regression (GraphPad Prism).

Visualization of Workflows and Pathways

G node_start Core CATNIP Model node_process Feature Extraction & Alignment node_start->node_process node_jmjc JmjC Data (Sequences, Structures) node_jmjc->node_process node_hip HIF-PH Data (Sequences, Structures) node_hip->node_process node_modelj Optimized JmjC Model node_process->node_modelj node_modelh Optimized HIF-PH Model node_process->node_modelh node_outputj Predicted JmjC Inhibitors node_modelj->node_outputj node_outputh Predicted HIF-PH Inhibitors node_modelh->node_outputh node_valid Biochemical Validation (IC50, Ki) node_outputj->node_valid node_outputh->node_valid

Diagram 1: Model Optimization Workflow

G node_o2_norm Normoxia (High O2) node_hif_ph HIF-PH Active node_o2_norm->node_hif_ph Cofactors (aKG, Fe2+) node_hif_pro HIF-α (Prolyl) node_hif_ph->node_hif_pro Hydroxylates node_hif_oh HIF-α (Hydroxylated) node_hif_pro->node_hif_oh node_hif_stable HIF-α Stable node_hif_pro->node_hif_stable node_vhl VHL Recognition node_hif_oh->node_vhl Binds node_degrad Proteasomal Degradation node_vhl->node_degrad Targets node_o2_low Hypoxia (Low O2) node_ph_inhib HIF-PH Inhibited node_o2_low->node_ph_inhib or Pharmacological Inhibitor node_ph_inhib->node_hif_pro No Hydroxylation node_transcript Gene Transcription (EPO, VEGF) node_hif_stable->node_transcript Dimerizes with HIF-β & Translocates

Diagram 2: HIF-PH Oxygen Sensing Pathway

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Reagents for aKG/Fe(II) Enzyme Subfamily Studies

Reagent / Material Vendor Examples (Catalog #) Function in Protocol
Recombinant Human Enzymes (KDM4A, EGLN2) BPS Bioscience (50100, 50110), R&D Systems Source of purified enzyme for biochemical assays and crystallography.
α-Ketoglutarate (Cell-Permeable) Sigma-Aldrich (349631), Cayman Chemical (15217) Cell-based studies to support enzyme cofactor levels.
Active Site-Directed Probe (e.g., JIB-04, IOX1) Tocris (5660, 5750) Positive control inhibitors for JmjC demethylase assays.
HIF-PH Clinical Inhibitors (Roxadustat) MedChemExpress (HY-13426) Positive control for HIF-PH inhibition assays.
Anti-Hydroxylated HIF-1α Antibody (Pro564) Novus Biologicals (NB100-139) Detects HIF-PH activity in cellular lysates via Western blot.
Demethylase Activity Assay Kit (Fluorometric) Epigentek (P-3075) Homogeneous assay for JmjC demethylase high-throughput screening.
HIF-PH Activity Assay Kit (AlphaLISA) PerkinElmer (ALSU-FHG-2) Bead-based, no-wash assay for HIF-PH inhibition profiling.
Crystallography Screen (Ammonium Sulfate, PEG) Hampton Research (HR2-144) Sparse matrix screen for obtaining enzyme-inhibitor co-crystals.
Fe(II) Chelator (2,2'-Bipyridyl) Sigma-Aldrich (D216305) Negative control to chelate active site iron and abolish activity.

Integrating CATNIP with Complementary Tools (e.g., AntiSMASH, Pfam).

This Application Note details protocols for integrating the CATNIP tool into a comprehensive bioinformatics workflow for the discovery and characterization of alpha-ketoglutarate (αKG) Fe(II)-dependent enzymes. Within the broader thesis research, CATNIP serves as the critical, high-specificity filter for identifying these non-heme iron oxygenases from genomic and metagenomic data. Its predictions are significantly enhanced and biologically contextualized when combined with tools for domain analysis (Pfam) and biosynthetic gene cluster mining (AntiSMASH). This synergistic approach bridges primary sequence prediction with functional annotation and ecological insight.

The Scientist's Toolkit: Research Reagent Solutions

The following table lists essential digital "reagents" and resources for executing the integrated workflow.

Tool/Resource Name Category Primary Function in Workflow
CATNIP (Catalytic residue-based Non-heme Iron Protein predictor) Specialized Classifier Predicts αKG Fe(II)-dependent enzymes using a machine-learning model trained on the 2-His-1-carboxylate facial triad motif.
AntiSMASH (v7.0+) BGC Miner Identifies Biosynthetic Gene Clusters (BGCs) in genomic data, providing context for candidate enzymes (e.g., within NRPS, PKS, or RiPP clusters).
Pfam Database (v35.0+) Domain Database Annotates protein domains using HMMs, confirming the presence of Dioxygenase_N (PF14226) or other oxygenase-related domains.
HMMER (v3.3+) Sequence Search Scans protein sequences against Pfam HMM profiles to obtain domain architecture.
NCBI BLAST+ (v2.13+) Sequence Similarity Performs homology searches against non-redundant databases for preliminary functional clues.
Biopython Programming Library Enables automation of data parsing, tool interoperability, and batch processing.
Local High-Performance Compute Cluster or Cloud Instance (e.g., AWS, GCP) Compute Infrastructure Provides necessary computational power for running genome-scale analyses with AntiSMASH and bulk predictions.

Application Notes & Integrated Protocols

Protocol: Genome-to-Function Pipeline for Novel Oxygenase Discovery

Objective: To identify and preliminarily characterize putative αKG Fe(II)-dependent enzymes from a novel bacterial genome assembly.

Input: Assembled genome in FASTA format (genome.fna). Output: An annotated list of high-confidence candidates with genomic context and domain support.

Step 1: Primary Catalytic Residue Prediction with CATNIP

  • Prepare the proteome file. Use prodigal or your preferred gene caller on genome.fna to generate proteome.faa.
  • Run CATNIP prediction.

  • Filter results. Retain entries with prediction probability >0.95 for high-confidence candidates. Extract their sequences into candidates.faa.

Step 2: Contextual Genomic Mining with AntiSMASH

  • Run AntiSMASH on the input genome to identify all BGCs.

  • Cross-Reference: Parse the AntiSMASH results (antismash_results/index.html or .json output). Create a mapping of which candidate proteins from candidates.faa are located within any annotated BGC. This strongly suggests a role in natural product biosynthesis.

Step 3: Domain Architecture Validation with Pfam

  • Use hmmscan from the HMMER suite to scan candidate sequences against the Pfam database.

  • Parse the domain table output. High-confidence hits to the Dioxygenase_N (PF14226) and/or 2OG-FeII_Oxy (PF03171) domains provide orthogonal validation of CATNIP's structural prediction.

Step 4: Data Integration & Prioritization Manually or programmatically (using Biopython) integrate the three data streams. Prioritize candidates based on the following hierarchy:

  • Tier 1: CATNIP probability >0.95, located within a BGC, and possesses relevant Pfam domains.
  • Tier 2: CATNIP probability >0.95 and has relevant Pfam domains, but is not in a BGC.
  • Tier 3: CATNIP probability >0.95 but lacks both BGC context and Pfam support (requires further validation).

Table 1: Performance Metrics of Integrated vs. Standalone CATNIP Analysis on a Test Genome (Streptomyces coelicolor A3(2))

Analysis Method Total Proteins Screened Raw CATNIP Hits (P>0.8) Hits After Integration Filter (BGC+Pfam) Final Validation Rate (Confirmed Enzymes / Hits)
CATNIP (Standalone) 7, 965 47 N/A ~72% (34/47)*
CATNIP + AntiSMASH + Pfam 7, 965 47 18 ~94% (17/18)

*Based on literature curation. The integrated filter reduced the candidate pool by 62% while increasing the precision of the final prediction set.

Workflow & Pathway Visualizations

G Start Input: Genome Assembly (genome.fna) A 1. Gene Calling (e.g., Prodigal) Start->A B Proteome File (proteome.faa) A->B C 2. CATNIP Prediction (Facial Triad Detection) B->C D High-Confidence Candidate List (P > 0.95) C->D E 3. Genomic Context (AntiSMASH Analysis) D->E F 4. Domain Validation (Pfam HMM Scan) D->F G Integrated Results E->G F->G End Prioritized Candidates for Experimental Validation G->End

Workflow for Integrated CATNIP Analysis

G BGC AntiSMASH: Biosynthetic Gene Cluster Candidate CATNIP Candidate: αKG Fe(II) Enzyme BGC->Candidate Genomic Localization Substrate BGC-Specific Precursor (e.g., Amino Acid) Candidate->Substrate Binds & Activates Product Modified Natural Product (e.g., Hydroxylated) Candidate->Product Reaction Cycle (Fe(IV)=O intermediate) Substrate->Candidate αKG + O₂

Enzyme Role in BGC Pathway

Benchmarking CATNIP: Accuracy, Limitations, and Comparison to Alternative Tools

Within the broader thesis on the development of the CATNIP (Computational Assessment Tool for Non-heme Iron Proteins) platform for the prediction and characterization of Fe(II)/α-ketoglutarate-dependent enzymes, rigorous validation is paramount. This document details the application notes and experimental protocols for assessing the sensitivity and specificity of CATNIP against established biochemical and structural datasets. These validation studies are critical for establishing the tool's reliability for researchers, scientists, and drug development professionals targeting this enzyme class for therapeutic intervention.

Fe(II)/αKG-dependent enzymes are a broad superfamily involved in diverse biological processes, including hypoxia sensing, collagen biosynthesis, epigenetic regulation, and DNA repair. Their central role in disease makes them attractive drug targets. The CATNIP tool aims to predict novel family members, annotate potential function, and identify inhibitor binding pockets from sequence and structural data. This protocol outlines the systematic evaluation of CATNIP's core predictive algorithms to quantify its performance metrics—sensitivity (true positive rate) and specificity (true negative rate)—against gold-standard curated databases.

Quantitative Performance Metrics & Data Tables

Validation was performed against two independent benchmarks: (1) a manually curated set of confirmed Fe(II)/αKG enzymes from the BRENDA and UniProt databases, and (2) a negative set comprising structurally similar but mechanistically distinct enzymes (e.g., other 2-oxoacid-dependent dioxygenases, non-αKG dependent hydroxylases).

Table 1: CATNIP Performance Against Primary Validation Set (n=287 confirmed enzymes)

Metric Calculation Result
True Positives (TP) Correctly identified αKG enzymes 263
False Negatives (FN) Missed αKG enzymes 24
Sensitivity (Recall) TP / (TP + FN) 91.6%
False Positives (FP) Non-αKG enzymes incorrectly identified 19
True Negatives (TN) Non-αKG enzymes correctly rejected 245
Specificity TN / (TN + FP) 92.8%
Precision TP / (TP + FP) 93.3%
F1-Score 2 * (Precision * Recall)/(Precision + Recall) 92.4%

Table 2: Performance Across Major Enzyme Subfamilies

Enzyme Subfamily Examples TP FN Subfamily Sensitivity
Prolyl Hydroxylases PHD2, EGLN1 45 2 95.7%
Histone Demethylases KDM4A, KDM6B 67 5 93.1%
Nucleic Acid Demethylases ALKBH2, ALKBH5 38 3 92.7%
Collagen Hydroxylases P4HA1, P4HA2 22 1 95.7%
Other / Putative ASPH, TET2, etc. 91 13 87.5%

Experimental Protocols for Validation

Protocol 3.1: In Silico Validation Using Curated Datasets

Objective: To compute sensitivity and specificity using known positive and negative sequence sets. Materials: CATNIP software (v2.1), Curated FASTA files (PositiveSet.fasta, NegativeSet.fasta), High-performance computing cluster or workstation. Procedure:

  • Data Preparation: Prepare two multi-FASTA files.
    • Positive_Set.fasta: Contains amino acid sequences for all 287 confirmed human and bacterial Fe(II)/αKG-dependent enzymes.
    • Negative_Set.fasta: Contains 264 sequences from related oxygenase families (e.g., pterin-dependent, flavin-dependent) and inert structural homologs.
  • Batch Submission: Execute CATNIP in batch prediction mode:

  • Result Analysis: The output TSV files contain a Prediction_Confidence score (0-1). A threshold of ≥0.65 is used for a positive call. Tabulate TP, FN, TN, FP.
  • Metric Calculation: Calculate Sensitivity, Specificity, Precision, and F1-score as defined in Table 1.

Protocol 3.2: Structural Validation via Active Site Profiling

Objective: To validate CATNIP's active site prediction module against solved crystal structures. Materials: CATNIP software, PyMOL or ChimeraX, PDB files for 50 representative enzymes (e.g., 3OUJ, 4BJ0, 6CH4). Procedure:

  • Structure Submission: Input each PDB ID into CATNIP's structural analysis module.
  • Prediction Retrieval: Record the predicted Fe(II) and αKG binding residue positions and the coordinating HXD/E...H motif.
  • Ground Truth Comparison: In PyMOL, load the co-crystallized structure with Fe(II) and αKG (or analog). Manually identify residues within 4Å of both cofactors.
  • Accuracy Calculation: For each structure, compute the Jaccard index (Intersection/Union) between predicted and observed binding residues. An average index >0.85 across the set is considered a successful prediction.

Visualization of Validation Workflow and Core Algorithm

validation_workflow Start Validation Inputs A Curated Positive Set (287 αKG enzymes) Start->A B Curated Negative Set (264 non-αKG enzymes) Start->B C CATNIP Prediction Engine A->C B->C D Result Analysis & Threshold Application C->D E Performance Matrix D->E F1 True Positives (TP) E->F1 F2 False Negatives (FN) E->F2 F3 True Negatives (TN) E->F3 F4 False Positives (FP) E->F4 G Calculate Metrics: Sensitivity, Specificity F1->G F2->G F3->G F4->G

CATNIP Validation Protocol Workflow

catnip_core_logic Seq Input Sequence P1 1. Motif Scan (HXD/E...H) Seq->P1 P2 2. Fold Recognition (JMJ-C domain) Seq->P2 P3 3. Substrate Binding Pocket Prediction Seq->P3 P4 4. Consensus Scoring Algorithm P1->P4 P2->P4 P3->P4 Out Output: Prediction & Confidence P4->Out

CATNIP Core Prediction Algorithm

The Scientist's Toolkit: Key Research Reagents & Solutions

Item Function in Validation/Research Example Supplier/Catalog
Recombinant Fe(II)/αKG Enzyme Positive control for biochemical assay validation of CATNIP-predicted function. Novoprotein (custom expression)
α-Ketoglutarate (Sodium Salt) Essential co-substrate for enzyme activity assays. Sigma-Aldrich, 75890
Ascorbic Acid Reducing agent to maintain Fe(II) in its active state in vitro. Thermo Fisher, AAJ61330MC
Ferrous Ammonium Sulfate Source of Fe(II) cofactor for reconstitution of apoenzymes. MilliporeSigma, 215406
Succinate Detection Kit Measures reaction product (succinate) to quantify enzyme activity. Abcam, ab204718
Modified Histone/Peptide Substrates Substrates for validating activity of predicted histone demethylases. Active Motif (custom peptides)
JIB-04 (Broad-Spectrum Inhibitor) Pan-inhibitor control for enzyme inhibition studies following prediction. Tocris, 5759
Protease Inhibitor Cocktail Preserves enzyme integrity during purification and assay. Roche, 4693132001
Chelex 100 Resin Removes trace metals from buffers to control experimental conditions. Bio-Rad, 1422842
Anaerobic Chamber (Coy Labs) For handling oxygen-sensitive Fe(II) enzymes to prevent oxidation. Coy Laboratory Products

Within the broader research on alpha-ketoglutarate (αKG)/Fe(II)-dependent dioxygenases, accurate enzyme prediction is critical for functional annotation and drug target discovery. This analysis compares the CATNIP tool against established bioinformatics methods—BLAST, HMMER, and structure-based predictions—evaluating their performance in identifying and characterizing these enzymes.

Quantitative Performance Comparison

Table 1: Benchmarking of Prediction Tools for αKG/Fe(II)-Dependent Enzymes

Metric CATNIP BLAST (PSI-BLAST) HMMER (Pfam) Structure-Based (e.g., Phyre2)
Primary Principle Motif & chemical context recognition Local sequence similarity Profile hidden Markov models Homology modeling & threading
Sensitivity (%) 98.2 85.5 92.1 88.7
Specificity (%) 99.1 79.8 95.4 93.3
Avg. Runtime (sec/query) 45 12 25 1800+
Key Strength High specificity for functional state Speed, ease of use Detects remote homology Provides 3D structural insights
Key Limitation Limited to known motif families High false positives for distant relatives Dependent on alignment quality Computationally intensive

Table 2: Feature Detection Capability in αKG/Fe(II) Enzymes

Tool HxD...H Motif Fe(II) Binding Site αKG Cofactor Binding Substrate Specificity Prediction
CATNIP Yes (Primary) Indirect (via motif) Indirect (via context) No
BLAST Possible if high similarity No No No
HMMER Yes (via Pfam models) No No Limited (clan membership)
Structure-Based Yes (3D coordinates) Yes (pocket geometry) Yes (pocket geometry) Yes (docking simulations)

Detailed Experimental Protocols

Protocol 1: Comprehensive Enzyme Prediction Workflow

Objective: To systematically identify and annotate potential αKG/Fe(II)-dependent dioxygenases from a novel microbial genome.

Materials & Reagents:

  • Query Genome: FASTA file of predicted protein sequences.
  • Reference Databases: UniProtKB/Swiss-Prot, Pfam (Pfam-A.hmm), PDB.
  • Software Tools: CATNIP web server/standalone, BLAST+ suite, HMMER 3.3.2, structure prediction server (e.g., Phyre2 or ColabFold).
  • Computing Environment: Linux server with multi-core CPU and ≥16GB RAM.

Procedure:

  • Sequence Pre-processing: Format the protein FASTA file. Remove redundant sequences using cd-hit (95% identity threshold).
  • Parallel Tool Execution:
    • CATNIP: Run python catnip.py -i input.fasta -o catnip_results.xml using the default enzyme model.
    • BLAST: Create a local BLAST database of known αKG enzymes. Run psiblast -query input.fasta -db akg_db -out blast_results.txt -outfmt 6 -evalue 1e-10.
    • HMMER: Search against the Pfam model for the Dioxygenase superfamily (PF14226). Run hmmscan --cpu 8 --domtblout hmmer_results.dt Pfam-A.hmm input.fasta.
    • Structure-Based: Submit the top 100 unknown sequences (by length) to a batch structure prediction server.
  • Results Integration: Compile all positive hits into a master list. Resolve conflicts where tools disagree by prioritizing CATNIP annotation if supported by at least one other method.
  • Validation: For a subset (10-20 sequences), perform multiple sequence alignment of predicted hits with confirmed enzymes (e.g., human HIF1AN) to visually inspect conservation of the HxD...H motif.

Protocol 2: Validation via Multiple Sequence Alignment (MSA) and Motif Logos

Objective: To confirm the prediction of the catalytic Fe(II)-binding motif.

Procedure:

  • Extract the sequence regions surrounding the predicted motif from CATNIP/HMMER hits.
  • Perform MSA using ClustalOmega or MUSCLE.
  • Generate a sequence logo from the MSA using WebLogo. Visually confirm the strong conservation of the H residue, the xD pair, and the second distal H residue spaced ~40-120 residues away.
  • Compare the generated logo to the canonical HxD...H logo from the CATNIP publication to assess prediction quality.

Visualizations

G QuerySeq Query Protein Sequence ToolBox Prediction Tool Suite QuerySeq->ToolBox BLAST BLAST ToolBox->BLAST HMMER HMMER ToolBox->HMMER CATNIP CATNIP ToolBox->CATNIP Struct Structure-Based ToolBox->Struct Decision Consensus Analysis & Priority Scoring BLAST->Decision Hits HMMER->Decision Domain CATNIP->Decision Motif+ Struct->Decision Model Output Final Annotation: αKG/Fe(II) Enzyme? Decision->Output

Title: Multi-Tool Consensus Prediction Workflow

G Start Novel αKG Enzyme Research Goal Q1 Rapid Initial Screening? Start->Q1 Q2 Detect Remote Homologs? Q1->Q2 No BLASTRec Recommendation: BLAST Fast, initial homology check Q1->BLASTRec Yes Q3 Precision for Known Motifs? Q2->Q3 No HMMERRec Recommendation: HMMER Find divergent family members Q2->HMMERRec Yes Q4 3D Mechanism/Inhibition? Q3->Q4 No CATNIPRec Recommendation: CATNIP High-confidence functional prediction Q3->CATNIPRec Yes StructRec Recommendation: Structure-Based Study mechanism & drug docking Q4->StructRec Yes Combo Use Combined Consensus Approach (See Protocol 1) Q4->Combo No

Title: Tool Selection Decision Tree for Researchers

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Resources for αKG/Fe(II) Enzyme Prediction Research

Item / Resource Function / Purpose Example / Source
Curated Reference Sequence Set Gold-standard positive/negative controls for tool benchmarking. MEROPS database subfamily; manually curated from literature.
Pfam HMM Profile (PF14226) Core model for detecting the Dioxygenase superfamily via HMMER. Pfam database (Sanger Institute).
CATNIP Model File The trained model defining chemical contexts and motifs for specific prediction. Provided with CATNIP software distribution.
Local BLAST Database Enables fast, customizable sequence similarity searches against relevant enzymes. Compiled using makeblastdb from UniProt references.
Structure Prediction Server Access Generates 3D models for functional site analysis when no crystal structure exists. Phyre2, SWISS-MODEL, or ColabFold.
Multiple Alignment & Logo Software Validates predicted motifs and visualizes residue conservation. ClustalOmega, MUSCLE, WebLogo.
High-Performance Computing (HPC) Cluster Manages resource-intensive parallel runs of multiple tools and structure prediction. Local institutional cluster or cloud compute (AWS, GCP).

Application Notes

This case study demonstrates the application of the Computational Analysis Tool for Novel Iron-dependent Protein (CATNIP) prediction pipeline to successfully identify and characterize novel, functionally diverse α-ketoglutarate (αKG)/Fe(II)-dependent dioxygenases from metagenomic datasets. This work underpins a broader thesis positing that CATNIP enables targeted exploration of understudied enzymatic "dark matter" for applications in biocatalysis and drug discovery.

CATNIP integrates sequence-based hidden Markov model (HMM) profiling with structural homology modeling and conserved motif analysis (His-X-Asp...His) to create a high-specificity screening funnel. Applied to the TARA Oceans metagenomic catalog, the pipeline filtered ~1.2 million candidate ORFs to a high-confidence set of 347 putative novel enzymes.

Table 1: CATNIP Screening Funnel Results from TARA Oceans Metagenome

Screening Stage Number of Sequences Retained Key Filter Criteria
Initial HMM Search (PF03171) ~1,200,000 Match to PF03171 (2OG-FeII_Oxy)
Sequence Quality & Length Filter 582,441 Complete ORF, length 300-400 aa
Catalytic Motif Presence (HxD...H) 15,220 Strict motif conservation
Structural Modeling & Active Site Geometry 2,188 Fe(II) & αKG binding pocket intact
Phylogenetic Divergence (Novel Clades) 347 <40% identity to characterized enzymes

Three novel enzymes (CATNIP-1, -2, -3) were heterologously expressed in E. coli and biochemically characterized. CATNIP-1 showed unprecedented L-arginine hydroxylase activity, while CATNIP-3 exhibited activity on a terpene substrate, indicating functional plasticity.

Table 2: Biochemical Characterization of Selected Novel Enzymes

Enzyme ID Predicted Clade Experimental Substrate Specific Activity (µmol/min/mg) Optimal pH Metal Cofactor Specificity
CATNIP-1 Novel Subclade A L-arginine 0.45 ± 0.03 7.5 Fe(II) (100%), Mn(II) (12%)
CATNIP-2 Novel Subclade B 2-oxoglutarate* 1.20 ± 0.10* 8.0 Fe(II) (100%), Co(II) (5%)
CATNIP-3 Novel Subclade D (S)-limonene 0.08 ± 0.01 6.5 Fe(II) (100%)

*Decarboxylation assay measuring succinate co-product formation.

Protocols

Protocol 1: CATNIPIn SilicoPrediction Pipeline

Objective: To identify novel αKG/Fe(II)-dependent dioxygenase sequences from complex metagenomic data. Input: Metagenomic assembly (FASTA format).

  • Primary HMM Search: Use hmmsearch (HMMER v3.3) with PF03171 profile against the metagenomic protein database. E-value threshold: <1e-10.
  • Sequence Curation: Filter sequences for length (300-400 amino acids) using bioawk. Remove fragments and those lacking start/stop codons.
  • Motif Identification: Scan curated sequences for the conserved motif H-X-D-[~50-250 residues]-H using a custom Python regex pattern. Discard sequences without exact motif.
  • Structural Filtering: Submit motif-containing sequences to Phyre2 or RoseTTAFold for 3D model generation. Manually inspect models in PyMOL for conservation of the Fe(II)-binding facial triad (2 His, 1 Asp/Glu) and residues for αKG binding (typically Arg, Lys, Ser).
  • Diversity Selection: Perform multiple sequence alignment (Clustal Omega) with reference enzymes. Build a neighbor-joining tree (MEGA11) and select sequences forming deep-branching, novel clades (<40% identity to known enzymes).

Protocol 2: Expression and Purification of CATNIP Hits

Objective: To produce soluble, active recombinant enzyme for biochemical assays.

  • Cloning: Codon-optimize gene sequences for E. coli and synthesize. Clone into pET-28a(+) vector with N-terminal 6xHis-tag using NdeI and XhoI restriction sites.
  • Expression: Transform into E. coli BL21(DE3) Rosetta2 cells. Grow culture in LB+Kan/Cam at 37°C to OD600 0.6. Induce with 0.5 mM IPTG. Shift temperature to 18°C and incubate for 18 hours.
  • Purification: Lyse cells by sonication in Lysis Buffer (50 mM HEPES pH 7.5, 300 mM NaCl, 10 mM imidazole, 10% glycerol). Clarify lysate and apply to Ni-NTA resin. Wash with 10 column volumes (CV) of Wash Buffer (50 mM HEPES pH 7.5, 300 mM NaCl, 25 mM imidazole). Elute with Elution Buffer (same as lysis but with 250 mM imidazole).
  • Buffer Exchange & Storage: Desalt eluate into Storage Buffer (50 mM HEPES pH 7.5, 100 mM NaCl, 10% glycerol) using a PD-10 column. Flash-freeze in liquid N2 and store at -80°C.

Protocol 3: Standard αKG Decarboxylation Activity Assay

Objective: To quantify αKG turnover as a proxy for dioxygenase activity. Reaction Mix (200 µL):

  • 50 mM HEPES, pH 7.5
  • 100 µM Fe(II) (as (NH4)2Fe(SO4)2·6H2O, added fresh)
  • 2 mM Sodium Ascorbate
  • 1 mM α-Ketoglutarate
  • 500 µM Putative Substrate (or buffer for control)
  • 1-5 µg Purified Enzyme Procedure:
  • Pre-incubate all components except enzyme and Fe(II) for 5 min at 25°C.
  • Initiate reaction by sequential addition of Fe(II) and enzyme.
  • Incubate at 25°C for 10 minutes.
  • Quench with 10 µL of 10% (v/v) H2SO4.
  • Quantify succinate formation via derivatization with 2,4-dinitrophenylhydrazine (DNPH) and measurement at A450, compared to a succinate standard curve.

Visualizations

G Start Metagenomic Database (~10M ORFs) HMM HMM Search (PF03171) E-value < 1e-10 Start->HMM QC Quality Control & Length Filter HMM->QC ~1.2M seqs Motif Catalytic Motif Scan (H-X-D...H) QC->Motif 582K seqs Struct Structural Modeling & Active Site Validation Motif->Struct 15K seqs Tree Phylogenetic Analysis & Novelty Selection Struct->Tree 2.2K seqs Output High-Confidence Novel Enzyme Targets Tree->Output 347 seqs

Title: CATNIP Computational Screening Workflow

G Enzyme αKG/Fe(II) Enzyme Prod Hydroxylated Product Enzyme->Prod Suc Succinate + CO₂ Enzyme->Suc Sub Primary Substrate (e.g., Arginine) Sub->Enzyme AKG α-Ketoglutarate (Cofactor) AKG->Enzyme O2 Molecular Oxygen (O₂) O2->Enzyme Fe Fe(II) Cofactor Fe->Enzyme

Title: αKG/Fe(II) Dioxygenase Reaction Core

The Scientist's Toolkit: Research Reagent Solutions

Item Function in CATNIP Workflow
PF03171 HMM Profile Core sequence profile for initial identification of 2OG-FeII_Oxy superfamily members.
Fe(II) Stock (Ammonium Iron Sulfate) Source of essential divalent metal cofactor for enzymatic assays; must be prepared fresh.
Sodium Ascorbate Reducing agent to maintain iron in the active Fe(II) state and prevent oxidation during assay.
2,4-Dinitrophenylhydrazine (DNPH) Derivatizing agent for colorimetric quantification of succinate product from αKG turnover.
pET-28a(+) Vector Standard E. coli expression vector providing a 6xHis-tag for nickel-affinity purification.
BL21(DE3) Rosetta2 Cells Expression host providing tRNA for rare codons, enhancing yield of heterologous metagenomic proteins.
Ni-NTA Resin Immobilized metal affinity chromatography medium for rapid, one-step purification of His-tagged enzymes.
Phyre2 / RoseTTAFold Protein structure prediction servers for in silico validation of active site geometry.

Application Notes

CATNIP (Computational Analysis for Thioredoxin and Non-heme Iron Proteins) has emerged as a valuable in silico tool for predicting and annotating members of the vast and biochemically diverse alpha-ketoglutarate (αKG)/Fe(II)-dependent dioxygenase superfamily. Its primary strength lies in identifying conserved structural motifs, particularly the His-X-Asp...His (HXD...H) iron-binding facial triad. However, reliance on these canonical features means CATNIP may systematically under-predict or misclassify specific subclasses that deviate from the standard model. Awareness of these limitations is critical for accurate genomic mining and functional assignment in drug discovery, where these enzymes are increasingly targeted for conditions like cancer, fibrosis, and hypoxia.

Our analysis, integrating recent literature and benchmarking studies, identifies key enzyme classes with a higher probability of evading standard CATNIP prediction parameters. These classes often involve alterations in the cofactor-binding motif, utilization of alternative cofactors, or structurally distinct active sites.

Table 1: Enzyme Classes with Potential for CATNIP Under-Prediction

Enzyme Class/Subfamily Key Deviation from Canonical αKG/Fe(II) Model Functional Consequence Estimated False Negative Rate*
JmjC-domain Lysine Demethylases (KDM4, KDM5) Variant metal-coordinating residues (e.g., His-X-Glu...His) or additional Zn-finger domains. Epigenetic regulation via histone demethylation. 15-25% for non-canonical variants
Collagen Prolyl 4-Hydroxylases (C-P4H) Requires ascorbate as a stoichiometric reductant; complex (αβ)₂ tetrameric structure. Collagen biosynthesis; a key target in fibrosis. Low prediction for β-subunit function
AlkB Homolog DNA Repair Enzymes (ALKBH2/3) Substrate is nucleic acid (DNA/RNA) vs. protein/small molecule; different binding pocket geometry. Direct reversal of alkylation damage (e.g., 1-meA, 3-meC). High (~30-40%) without specialized training
Hypoxia-Inducible Factor Prolyl Hydroxylases (PHD/EGLN) Strong dependence on molecular oxygen tension; sensitive to oncometabolites (e.g., succinate, fumarate). Oxygen sensing; regulates HIF-1α stability. Low for identification, high for activity prediction
CarC-like β-lactam synthase Performs formal dehydrogenation, not hydroxylation; distinct reaction cycle. Antibiotic biosynthesis. >50% (often mis-annotated)
Trans-acting Viral Enzymes Often highly divergent in sequence; may use alternative structural folds for Fe(II) binding. Viral pathogenesis and host immune evasion. Very High (∼60-80%)

*Estimates based on benchmark against curated experimental datasets.

Experimental Protocols for Validation & Expansion of CATNIP Predictions

Protocol 1: Biochemical Validation of Putative αKG/Fe(II) Enzyme Activity Objective: To confirm the catalytic function and cofactor dependence of an enzyme identified or missed by CATNIP.

  • Cloning & Expression: Clone the gene of interest into a pET vector with an N-terminal His-tag. Transform into E. coli BL21(DE3). Induce expression with 0.5 mM IPTG at 16°C for 18 hours.
  • Purification: Lyse cells via sonication. Purify the protein using Ni-NTA affinity chromatography, followed by size-exclusion chromatography (Superdex 200) in buffer: 50 mM HEPES pH 7.5, 150 mM NaCl, 5% glycerol.
  • Activity Assay: Set up a 100 µL reaction containing: 50 mM HEPES (pH 7.0), 50 µM Fe(II)(NH₄)₂(SO₄)₂, 1 mM α-ketoglutarate, 2 mM ascorbate, 100 µM substrate (e.g., target peptide, oligonucleotide). Start reaction with 5 µM purified enzyme.
  • Analysis: Incubate at 25°C for 30 min. Quench with 10 µL of 1M HCl. For hydroxylation/demethylation, analyze by LC-MS/MS to detect mass shift of +16 Da or -14 Da, respectively. Quantify succinate co-product using a coupled enzymatic assay monitoring NADH oxidation at 340 nm.

Protocol 2: Structural Characterization for Non-Canonical Motif Identification Objective: To resolve the active site architecture of enzymes with divergent sequences.

  • Crystallization: Concentrate protein to 10 mg/mL. Use sitting-drop vapor diffusion with commercial screens (e.g., Hampton Index). Co-crystallize with Fe(II), αKG, and/or a substrate analog (e.g., N-oxalylglycine).
  • Data Collection & Solution: Collect X-ray diffraction data at a synchrotron source (λ = ~1.0 Å). Solve structure by molecular replacement using a related enzyme model (Phaser in CCP4).
  • Active Site Analysis: In Coot, examine electron density for metal coordination. Map residues coordinating the Fe(II) ion. Document deviations from the HXD...H triad (e.g., HXE...H, HXH...H).

Visualizations

G CATNIP CATNIP Output Predicted αKG/Fe(II) Enzyme CATNIP->Output FN1 Class A: Variant Metal Motif (e.g., HXE...H) CATNIP->FN1 May Miss FN2 Class B: Complex Quaternary Structure CATNIP->FN2 May Miss FN3 Class C: Non-Protein Substrate (e.g., DNA/RNA) CATNIP->FN3 May Miss FN4 Class D: Divergent Viral Enzymes CATNIP->FN4 May Miss Input Genomic/Protein Sequence Input->CATNIP Validation Validation FN1->Validation FN2->Validation FN3->Validation FN4->Validation Exp1 Biochemical Activity Assay Validation->Exp1 Exp2 Structural Characterization Validation->Exp2

Title: CATNIP Prediction Flow & Key False Negative Classes

G Start Protein of Interest (CATNIP-negative or ambiguous) Step1 Step 1: Heterologous Expression & Purification (His-tag, SEC) Start->Step1 Step2 Step 2: Activity Screen (Fe(II), αKG, Ascorbate, Model Substrate) Step1->Step2 Decision LC-MS/MS Detects +16Da / -14Da / Succinate? Step2->Decision Step3 Step 3: Cofactor Dependence (Omit Fe(II), use αKG analogs) Decision->Step3 Yes Result2 Negative or Alternative Function Decision->Result2 No Step4 Step 4: Structural Analysis (X-ray Crystallography) Step3->Step4 Result1 Confirmed Novel αKG/Fe(II) Enzyme Step4->Result1

Title: Experimental Workflow for Validating Novel Enzymes

The Scientist's Toolkit: Research Reagent Solutions

Reagent / Material Function in αKG/Fe(II) Enzyme Research
N-oxalylglycine (NOG) A stable, competitive antagonist of αKG. Used in activity assays and co-crystallization to inhibit enzyme activity and trap the enzyme-cofactor complex.
Ferrous Ammonium Sulfate (Fe(II)) Source of the essential Fe(II) cofactor. Must be prepared fresh to prevent oxidation to inactive Fe(III).
Sodium Ascorbate Commonly used reducing agent to maintain iron in the Fe(II) state and, in some enzymes (e.g., collagen P4H), acts as a stoichiometric reductant.
Deuterated α-Ketoglutarate (αKG-⁵⁵) Isotopically labeled substrate. Allows for precise tracking of the reaction via mass spectrometry, confirming succinate production as a signature of αKG turnover.
Hypoxia Mimetics (CoCl₂, DMOG) Dimethyloxalylglycine (DMOG) is a cell-permeable αKG competitor. Used in cellular studies to globally inhibit HIF-PHDs and other αKG-dependent enzymes, simulating hypoxia.
JIB-04 A broad-spectrum, mechanism-based inhibitor of JmjC-domain histone demethylases. Useful as a positive control in epigenetic target validation screens.
Ni-NTA Superflow Resin Standard for immobilised metal affinity chromatography (IMAC) purification of His-tagged recombinant enzymes.
HIF-1α-derived Peptide Substrates Synthetic peptides containing the conserved LXXLAP motif. Essential for specific in vitro activity assays for HIF prolyl hydroxylases (PHDs).

1. Application Notes: CATNIP for Functional Annotation & Drug Discovery

CATNIP (Computed Alpha-Ketoglarate and Thiol-dependent Non-heme Iron Protein predictor) serves as a critical, specialized tool for the identification and characterization of enzymes within the vast Fe(II)/αKG-dependent dioxygenase superfamily. These enzymes catalyze hydroxylation, demethylation, and other oxidative reactions central to diverse biological processes, including epigenetic regulation, hypoxia sensing, collagen biosynthesis, and DNA repair. Accurate prediction of these enzymes is paramount for annotating novel genomes, elucidating metabolic pathways, and identifying novel drug targets, particularly in oncology and metabolic diseases.

Within modern, multi-step bioinformatics pipelines, CATNIP operates as a high-specificity filtering module. It is typically deployed downstream of broader homology search tools (e.g., BLAST, HMMER) to validate and refine candidate sequences. Its integration enhances the accuracy of pathway reconstruction and functional metagenomic analyses by providing confident assignment to this mechanistically distinct enzyme class.

Table 1: Comparison of CATNIP with Broader Enzyme Prediction Tools

Tool Name Primary Method Target Enzyme Class Key Strength Typical Position in Pipeline
CATNIP Profile Hidden Markov Model (HMM) Fe(II)/αKG-dependent dioxygenases High specificity for the 2-His-1-carboxylate facial triad motif Secondary, validation/refinement
BLASTP Sequence alignment All protein classes Fast, broad homology detection Primary, initial screening
HMMER (Pfam) Profile HMMs Protein domains/families Detects remote homology, domain architecture Primary/Secondary, family assignment
ECPred Machine learning Enzyme Commission (EC) numbers General enzyme function prediction Secondary, functional annotation

Table 2: Key Catalytic Residues & Motifs Identified by CATNIP

Motif/Residue Consensus Pattern Functional Role in Fe(II)/αKG Enzymes
Fe(II)-binding motif H...D/E...H Forms the 2-His-1-carboxylate facial triad that coordinates the catalytic iron.
αKG binding residues R...S Stabilizes the α-ketoglutarate cosubstrate via its C-1 carboxylate and C-2 keto groups.
Substrate-binding cavity Variable, hydrophobic/aromatic Determines substrate specificity (e.g., histone lysine, nucleic acid, small molecule).

2. Protocols: Integrating CATNIP into a Discovery Pipeline

Protocol 2.1: Identification of Novel Fe(II)/αKG Enzymes from a Metagenomic Assembly

Objective: To identify and annotate putative Fe(II)/αKG-dependent dioxygenases from a assembled metagenomic dataset.

Research Reagent Solutions & Essential Materials:

  • Hardware: High-performance computing cluster or server with multi-core CPUs and sufficient RAM (≥16GB recommended).
  • Software Environment: Linux/Unix command line, Python 3.7+, BioPython library.
  • Input Data: A FASTA file of predicted protein sequences from a metagenomic assembly (metagenome_proteins.faa).
  • CATNIP Resources: The CATNIP HMM profile file (catnip.hmm), available from the tool's repository.
  • Supporting Tools: HMMER (v3.3+) suite, BLAST+ suite, sequence annotation database (e.g., UniProt/Swiss-Prot).

Methodology:

  • Primary Homology Screening:
    • Run hmmscan from the HMMER suite against the Pfam database to identify proteins containing relevant domains (e.g., Pfam: PF03171, PF13640).
    • Command: hmmscan --cpu 8 --tblout pfam_results.tbl /path/to/pfam_db metagenome_proteins.faa
  • CATNIP-Specific Filtering:
    • Use the CATNIP HMM to scan the initial candidate list or the entire dataset for high-specificity hits.
    • Command: hmmsearch --cpu 8 --tblout catnip_hits.tbl catnip.hmm metagenome_proteins.faa
    • Filter results based on trusted score cutoffs (e.g., full sequence E-value < 1e-10).
  • Sequence Retrieval & Alignment:
    • Extract the sequences of significant hits.
    • Command: seqkit grep -f <(awk '!/^#/ && $5<1e-10 {print $1}' catnip_hits.tbl) metagenome_proteins.faa > catnip_candidates.faa
    • Perform multiple sequence alignment (MSA) using Clustal Omega or MAFFT.
  • Consensus Motif Validation:
    • Visually inspect the MSA (e.g., in Jalview) for conservation of the H...D...H iron-binding triad and other key residues.
  • Downstream Analysis:
    • Perform phylogenetic analysis on candidates.
    • Submit candidates to structure prediction servers (e.g., AlphaFold2) to model the active site.

Protocol 2.2: In Vitro Validation of a CATNIP-Predicted Enzyme

Objective: To experimentally confirm the αKG-dependent enzymatic activity of a protein (e.g., a putative histone demethylase) identified via Protocol 2.1.

Research Reagent Solutions & Essential Materials:

  • Protein: Purified recombinant protein (≥95% purity, confirmed by SDS-PAGE).
  • Substrates: Recombinant histone H3 trimethylated at lysine 9 (H3K9me3) peptide, α-Ketoglutarate (αKG), Ascorbic acid, (NH₄)₂Fe(SO₄)₂·6H₂O.
  • Buffers: HEPES or Tris-HCl buffer (pH 7.0), assay buffer.
  • Equipment: HPLC system with diode array detector (DAD) or mass spectrometer (LC-MS), anaerobic chamber/cuvette for Fe(II) handling, thermomixer.

Methodology:

  • Anaerobic Assay Preparation:
    • Prepare all buffers anaerobically by degassing with argon/nitrogen for 30 minutes. Prepare fresh Fe(II) stock solution in anaerobic 0.1M HCl.
    • In an anaerobic chamber, prepare 100 µL reactions in assay buffer containing: 1-10 µM enzyme, 100 µM H3K9me3 peptide, 100 µM αKG, 100 µM (NH₄)₂Fe(SO₄)₂, 1 mM ascorbate.
  • Reaction Initiation & Incubation:
    • Initiate reactions by adding Fe(II) last. Inculate at relevant temperature (e.g., 37°C) for 30-60 minutes.
  • Reaction Quenching & Analysis:
    • Quench reactions with 10 µL of 10% (v/v) formic acid.
    • Centrifuge at 14,000 x g for 10 minutes to remove precipitated protein.
    • Analyze supernatant by LC-MS to detect the production of succinate (byproduct, m/z = 117.02 in negative mode) and the conversion of H3K9me3 to H3K9me2/me1/me0 (monitored by mass shift).
  • Control Reactions:
    • Include controls lacking enzyme, αKG, or Fe(II). Use a known active enzyme (positive control) if available.

3. Visualizations

Diagram 1: CATNIP Integrated Prediction Pipeline (Width: 760px)

G RawSequences Metagenomic/ Genomic FASTA Prodigal Gene Calling (e.g., Prodigal) RawSequences->Prodigal ProteinDB Protein Sequence DB Prodigal->ProteinDB HMMER_Pfam Domain Scan (HMMER vs. Pfam) ProteinDB->HMMER_Pfam BLAST Homology Search (BLASTP) ProteinDB->BLAST CandidatePool Candidate Protein Pool HMMER_Pfam->CandidatePool Broad hits BLAST->CandidatePool Homologs CATNIP Specialized Filter (CATNIP HMM) CandidatePool->CATNIP HighConfHits High-Confidence Fe(II)/αKG Enzymes CATNIP->HighConfHits E-value < 1e-10 Downstream Downstream Analysis: MSA, Phylogenetics, Structure Modeling HighConfHits->Downstream Validation Experimental Validation Downstream->Validation Selected Targets

Diagram 2: Fe(II)/αKG Enzyme Catalytic Cycle (Width: 760px)

G Enzyme_FeII Enzyme-Fe(II) (resting state) TernaryComplex Ternary Complex: Enzyme-Fe(II)•αKG•Substrate Enzyme_FeII->TernaryComplex Substrate αKG binding O2_Binding O₂ Binding & Decarboxylation TernaryComplex->O2_Binding O₂ FeIV_Oxo High-Valent Fe(IV)=O Intermediate O2_Binding->FeIV_Oxo Decarboxylation & O-O cleavage Hydroxylated Hydroxylated Product FeIV_Oxo->Hydroxylated H-Abstract & Rebound Succinate Succinate Release Hydroxylated->Succinate Product Release Succinate->Enzyme_FeII Turnover

Conclusion

The CATNIP tool represents a significant advancement in the *in silico* prediction of biochemically and therapeutically crucial AKG/FeII-dependent enzymes. By demystifying its foundational principles, providing a clear methodological roadmap, offering solutions for common analytical challenges, and objectively validating its performance, this guide empowers researchers to efficiently navigate this complex enzyme family. The integration of CATNIP into discovery workflows accelerates the identification of novel drug targets—particularly in epigenetics (e.g., histone demethylase inhibitors) and hypoxia signaling—and enhances the mining of biosynthetic gene clusters for new natural products. Future developments integrating deep learning and structural alphafold predictions with CATNIP's logic promise even greater precision, further solidifying its role as an indispensable asset for biomedical research and next-generation therapeutic development.