🎉 Sent!
You made it to the top. Submit your work above!
Submission
Deliverables
Submit your completed notebook (.ipynb) with:
- HSA sequence saved and poly-G binder generated (Exercise 0)
- AlphaFold Server results with ipTM/pTM/pLDDT recorded (Exercise 1)
- ProteinMPNN redesigned sequence with amino acid composition (Exercise 2)
- Boltz-2 re-prediction with updated metrics (Exercise 3)
- SolubleMPNN refinement in your own Colab notebook (Exercise 4)
- Validation results with improvement plot (Exercise 5)
- Reflection answers (Exercise 6)
File naming: lastname_firstname_R024.ipynb
Exercise 6: Reflection
Goal: Consolidate what you learned from the manual pipeline.
Answer in your notebook (2-3 sentences each):
-
What did the poly-G sequence look like structurally after the first prediction? Was it compact or disordered?
-
How did the structure change after the first ProteinMPNN redesign?
-
At which cycle did you first see improvement in ipTM? If metrics didn't improve, what might explain that?
-
What was the biggest challenge — the science, the coding, or wrangling the different platforms?
-
Protein Hunter uses the phrase "diffusion hallucination." In your own words, what does it mean for a model to "hallucinate" a protein structure?
-
Why is iterative cycling better than a single pass?
What's next? If you want to automate this entire loop, continue to the next route: The Hallucination Ascent (Automated).
Exercise 5: Validation with AlphaFold3
Goal: Feed the SolubleMPNN-refined sequence back into AlphaFold Server to validate improvement.
Step 1: Submit to AlphaFold Server
Return to https://alphafoldserver.com and create a new prediction:
- Chain A: HSA (same as Exercise 1)
- Chain B: Your SolubleMPNN-redesigned sequence
Name the job HSA_binder_cycle2_solMPNN.
Step 2: Record the Improvement
When the results come back, record all metrics. Fill in this table:
| Metric | Cycle 0 (poly-G) | Cycle 1 (MPNN → Boltz-2) | Cycle 2 (SolMPNN → AF3) |
|---|---|---|---|
| ipTM | ___ | ___ | ___ |
| pTM | ___ | ___ | ___ |
| pLDDT | ___ | ___ (complex) | ___ |
Note: Cycle 1 uses Boltz-2 which reports complex-level pLDDT. Cycles 0 and 2 use AlphaFold Server which provides per-chain pLDDT — use chain B's value for those.
Step 3: Visualize the Improvement
import matplotlib.pyplot as plt
cycles = [0, 1, 2]
iptm_scores = [0.0, 0.0, 0.0] # ← Fill in your real values!
plddt_scores = [0.0, 0.0, 0.0] # ← Fill in your real values!
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# ipTM plot
ax1.plot(cycles, iptm_scores, 'o-', color='#d63384', linewidth=2, markersize=10)
ax1.set_xlabel("Cycle", fontsize=13)
ax1.set_ylabel("ipTM", fontsize=13)
ax1.set_title("Interface Confidence (ipTM) Across Cycles", fontsize=14)
ax1.set_ylim(0, 1)
ax1.axhline(y=0.6, color='gray', linestyle='--', alpha=0.5, label='Promising threshold')
ax1.axhline(y=0.8, color='green', linestyle='--', alpha=0.5, label='Strong threshold')
ax1.legend()
ax1.grid(True, alpha=0.3)
# pLDDT plot
ax2.plot(cycles, plddt_scores, 's-', color='#0d6efd', linewidth=2, markersize=10)
ax2.set_xlabel("Cycle", fontsize=13)
ax2.set_ylabel("pLDDT", fontsize=13)
ax2.set_title("Structural Confidence (pLDDT) Across Cycles", fontsize=14)
ax2.set_ylim(0, 100)
ax2.axhline(y=70, color='gray', linestyle='--', alpha=0.5, label='Confident threshold')
ax2.axhline(y=90, color='green', linestyle='--', alpha=0.5, label='Excellent threshold')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("cycle_improvement.png", dpi=150)
plt.show()
Logbook Question
Look at how the ipTM changed from cycle 0 to cycle 2. Why do you think the confidence metrics improve (or don't) with each cycle? Connect your answer to the structure ↔ sequence cycling concept from the introduction.
Success check:
- I submitted the SolMPNN sequence to AlphaFold Server
- ipTM changed from cycle 0: ___ → cycle 2: ___
- I created the improvement plot
Exercise 4: SolubleMPNN Refinement
Goal: Use SolubleMPNN for a second round of inverse folding, biasing toward soluble, expressible proteins.
Why SolubleMPNN?
Standard ProteinMPNN optimizes for foldability alone. SolubleMPNN uses weights trained by excluding transmembrane protein structures, biasing designs toward proteins that are soluble — meaning they won't aggregate or crash out of solution when expressed in the lab. This is critical for real-world protein therapeutics.
The soluble weights ship with the ProteinMPNN repo in the soluble_model_weights/ directory. See the ProteinMPNN repository for full documentation.
Step 1: Set Up Your Own Colab Notebook
This is where it gets real. Unlike the previous steps, there's no pre-made Colab for SolubleMPNN — you will build one yourself.
Connect to a GPU in Colab: Go to Runtime → Change runtime type → Hardware accelerator → GPU (T4)
Create a new Google Colab notebook and run the following setup:
# Cell 1: Clone ProteinMPNN
!git clone -q https://github.com/dauparas/ProteinMPNN.git
%cd ProteinMPNN
import os
print("Soluble weights available:", os.listdir("soluble_model_weights"))
# Cell 2: Upload your JSON from Boltz-2 and extract the mmCIF structure
from google.colab import files
import json
uploaded = files.upload() # Upload the JSON file from Exercise 3
json_filename = list(uploaded.keys())[0]
print(f"Uploaded: {json_filename}")
# Load the Boltz-2 JSON output
with open(json_filename) as f:
boltz_data = json.load(f)
# Extract the embedded mmCIF structure and save it
# The structure is stored as a string inside the JSON
cif_content = boltz_data["structure"] # or check the actual key name in your JSON
cif_filename = "boltz2_structure.cif"
with open(cif_filename, "w") as f:
f.write(cif_content)
print(f"Extracted structure to: {cif_filename}")
# Cell 3: Convert CIF to PDB (ProteinMPNN expects PDB format)
!pip install biopython -q
from Bio.PDB import MMCIFParser, PDBIO
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("complex", cif_filename)
io = PDBIO()
io.set_structure(structure)
io.save("complex.pdb")
print("Converted to complex.pdb")
# Cell 4: Run SolubleMPNN on chain B only
!python protein_mpnn_run.py \
--pdb_path "complex.pdb" \
--pdb_path_chains "B" \
--out_folder "solmpnn_output" \
--num_seq_per_target 4 \
--sampling_temp "0.1" \
--use_soluble_model
Fallback: If
--use_soluble_modelisn't recognized, manually point to the weights:--path_to_model_weights soluble_model_weights/
# Cell 5: Read the output sequences
import glob
fasta_files = glob.glob("solmpnn_output/seqs/*.fa")
for f in sorted(fasta_files):
with open(f) as handle:
print(handle.read())
Step 2: Pick the Best Sequence
From the output sequences, pick the one with the lowest score (most confident). Record it.
# Your SolubleMPNN-redesigned sequence
solmpnn_seq = "YOUR_BEST_SEQUENCE_HERE"
print(f"SolMPNN sequence length: {len(solmpnn_seq)}")
Success check:
- I set up my own Colab notebook for SolubleMPNN
- I converted CIF → PDB using BioPython
- I ran SolubleMPNN with --use_soluble_model flag
- I have a new redesigned sequence of length ___
Exercise 3: Re-Prediction with Boltz-2
Goal: Feed the redesigned sequence back into a structure prediction model (Boltz-2) to refine the backbone.
Why Boltz-2?
The Protein Hunter paper found that Boltz-2 achieved the highest in silico success rate among all diffusion-based models tested. Two key reasons:
- Token-level encoding: Boltz-2 represents each non-canonical residue as a single learned token.
- Rigid-body alignment: Boltz-2 applies rigid-body alignment at every reverse diffusion step.
Step 1: Navigate to NVIDIA's Boltz-2 API
Go to: https://build.nvidia.com/mit/boltz2
This is a free API endpoint hosted by NVIDIA — no GPU required on your end.
Step 2: Set Up the Prediction
You need to create an input with two chains:
Chain A — HSA (same sequence as before)
Chain B — Your redesigned sequence from Exercise 2.
Submit the prediction via the web UI or the API.
Programmatic approach: See the Boltz-2 API Reference for Python snippets using
requests.post().
Step 3: Download the Output
Download the resulting JSON file from Boltz-2. The JSON contains:
- Confidence metrics (ipTM, pLDDT for the complex)
- An embedded mmCIF structure (the predicted 3D coordinates)
Note: Boltz-2 outputs JSON format. Inside the JSON, you'll find an mmCIF-formatted string containing the structure — you'll need to extract and save this as a
.ciffile for use in Exercise 4.
Step 4: Compare Metrics
Record the new confidence metrics:
| Metric | Cycle 0 (AlphaFold3, poly-G) | Cycle 1 (Boltz-2, redesigned) |
|---|---|---|
| ipTM | ___ | ___ |
| pLDDT (complex) | ___ | ___ |
Note: Boltz-2 reports pLDDT for the entire complex, not per-chain. Use the complex-level pLDDT for comparison.
Note: Metrics may not improve monotonically — fluctuations and plateaus are common, especially in early cycles. This is normal. Document the trend.
Success check:
- I submitted redesigned chain B + HSA to Boltz-2
- I downloaded the JSON output file
- I recorded updated metrics: ipTM = ___, pLDDT (complex) = ___
Exercise 2: ProteinMPNN Inverse Folding
Goal: Take the hallucinated backbone of chain B and redesign its sequence using ProteinMPNN.
The Science
ProteinMPNN (Dauparas et al., 2022) is an inverse folding model. Normal folding goes sequence → structure. ProteinMPNN goes structure → sequence: given a backbone, it predicts what amino acid sequence would fold into that shape.
Step 1: Convert CIF → PDB
ProteinMPNN's command-line interface expects PDB format. In a fresh Colab notebook:
!pip -q install biopython
from Bio.PDB import MMCIFParser, PDBIO
cif_path = "YOUR_DOWNLOADED_MODEL.cif" # ← upload from your AlphaFold Server zip
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("complex", cif_path)
io = PDBIO()
io.set_structure(structure)
io.save("cycle0_complex.pdb")
print("Saved PDB → cycle0_complex.pdb")
Step 2: Open a ProteinMPNN Notebook
Option A — ProteinMPNN Quick Demo (Colab): https://colab.research.google.com/github/dauparas/ProteinMPNN/blob/main/colab_notebooks/quickdemo.ipynb
Option B — ColabDesign wrapper (JAX): https://colab.research.google.com/github/sokrypton/ColabDesign/blob/v1.1.1/mpnn/examples/proteinmpnn_in_jax.ipynb
Option C — Run from the cloned repo directly:
!git clone -q https://github.com/dauparas/ProteinMPNN.git
%cd ProteinMPNN
!python protein_mpnn_run.py \
--pdb_path ../cycle0_complex.pdb \
--pdb_path_chains "B" \
--out_folder ../mpnn_cycle0_out \
--num_seq_per_target 8 \
--sampling_temp "0.1"
| Flag | What It Does |
|---|---|
--pdb_path | Path to the input PDB structure |
--pdb_path_chains "B" | Which chains to redesign (everything else is held fixed) |
--num_seq_per_target | How many sequences to sample |
--sampling_temp "0.1" | Temperature (string). Lower = more conservative designs |
Step 3: Record the Output
Open the FASTA output in mpnn_cycle0_out/seqs/ and pick the top sequence. Record the output sequence — this is your first real binder candidate!
Step 4: Inspect the Sequence
# Paste your redesigned sequence here
redesigned_seq_cycle1 = "YOUR_SEQUENCE_HERE"
# Quick stats
print(f"Length: {len(redesigned_seq_cycle1)} aa")
# Amino acid composition
from collections import Counter
aa_counts = Counter(redesigned_seq_cycle1)
print("Top 5 amino acids:")
for aa, count in aa_counts.most_common(5):
print(f" {aa}: {count} ({100*count/len(redesigned_seq_cycle1):.1f}%)")
Success check:
- I uploaded the CIF to ProteinMPNN Colab
- I selected Chain B only for redesign
- I have a redesigned amino acid sequence of length ___
- I noted the amino acid composition
Exercise 1: AlphaFold Server Prediction
Goal: Use AlphaFold3 to predict what a poly-G chain might look like when folded alongside HSA.
Step 1: Navigate to AlphaFold Server
Go to https://alphafoldserver.com
Sign in with your Google account. AlphaFold Server is free for academic/non-commercial use.
Plan for server quotas — AlphaFold Server has usage limits. Don't assume you can submit unlimited runs in a single sitting. Space your jobs across sessions.
Step 2: Set Up the Job
-
Click "New prediction"
-
Create two protein chains:
Chain A — Human Serum Albumin (the target):
MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAKTCVADESAENCDKSLHTLFGDKLCTVATLRETYGEMADCCAKQEPERNECFLQHKDDNPNLPRLVRPEVDVMCTAFHDNEETFLKKYLYEIARRHPYFYAPELLFFAKRYKAAFTECCQAADKAACLLPKLDELRDEGKASSAKQRLKCASLQKFGERAFKAWAVARLSQRFPKAEFAEVSKLVTDLTKVHTECCHGDLLECADDRADLAKYICENQDSISSKLKECCEKPLLEKSHCIAEVENDEMPADLPSLAADFVESKDVCKNYAEAKDVFLGMFLYEYARRHPDYSVVLLLRLAKTYETTLEKCCAAADPHECYAKVFDEFKPLVEEPQNLIKQNCELFEQLGEYKFQNALLVRYTKKVPQVSTPTLVEVSRNLGKVGSKCCKHPEAKRMPCAEDYLSVVLNQLCVLHEKTPVSDRVTKCCTESLVNRRPCFSALEVDETYVPKEFNAETFTFHADICTLSEKERQIKKQTALVELVKHKPKATKEQLKAVMDDFAAFVEKCCKADDKETCFAEEGKKLVAASQAALGLChain B — Your poly-G binder sequence from Exercise 0
-
Name the job
HSA_polyG_binder_cycle0 -
Click "Submit" and wait (usually 5–30 minutes)
Step 3: Download & Understand the Results
Download the results zip. Inside you'll find:
| File pattern | What It Contains |
|---|---|
fold_..._model_0.cif | Rank-0 (best) predicted structure |
fold_..._summary_confidences_0.json | Summary metrics: ipTM, pTM, ranking_score |
fold_..._full_data_0.json | Per-atom pLDDT, full PAE matrix |
Step 4: Parse the JSON
import json
from pathlib import Path
import numpy as np
def load_af3_summary(path):
"""Load ipTM/pTM from an AlphaFold Server summary JSON."""
data = json.loads(Path(path).read_text())
return {
"iptm": data.get("iptm"),
"ptm": data.get("ptm"),
"ranking_score": data.get("ranking_score"),
"fraction_disordered": data.get("fraction_disordered"),
"has_clash": data.get("has_clash"),
}
def mean_plddt_for_chain(full_data_json_path, chain_id="B"):
"""Compute mean pLDDT for a specific chain."""
data = json.loads(Path(full_data_json_path).read_text())
atom_plddts = np.array(data["atom_plddts"], dtype=float)
atom_chain_ids = np.array(data["atom_chain_ids"])
mask = (atom_chain_ids == chain_id)
return float(atom_plddts[mask].mean())
Step 5: Logbook Entry
Answer in your notebook:
- What is the ipTM score from cycle 0?
- Does the binder chain (B) look like it contacts HSA, or is it floating away?
- What's your intuition — does a poly-G sequence "deserve" a high confidence score?
Success check:
- I submitted a 2-chain prediction to AlphaFold Server
- I downloaded the results zip
- I recorded ipTM = ___, pTM = ___, pLDDT (chain B) ≈ ___
Exercise 0: Setup and Target
Goal: Prepare your workspace and understand your target protein.
A. Know Your Target: Human Serum Albumin (HSA)
HSA is a ~66.5 kDa transport protein and the most abundant protein in blood plasma. It ferries fatty acids, hormones, drugs, and metal ions through the bloodstream. Designing a protein that binds HSA is therapeutically relevant — HSA-binding peptides are used to extend the half-life of drugs in the body.
Here is the HSA sequence (UniProt P02768). Paste it as a single continuous string with no spaces or line breaks:
MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFE
DHVKLVNEVTEFAKTCVADESAENCDKSLHTLFGDKLCTVATLRETYGEMADCCAKQEPER
NECFLQHKDDNPNLPRLVRPEVDVMCTAFHDNEETFLKKYLYEIARRHPYFYAPELLFFA
KRYKAAFTECCQAADKAACLLPKLDELRDEGKASSAKQRLKCASLQKFGERAFKAWAVARL
SQRFPKAEFAEVSKLVTDLTKVHTECCHGDLLECADDRADLAKYICENQDSISSKLKECCE
KPLLEKSHCIAEVENDEMPADLPSLAADFVESKDVCKNYAEAKDVFLGMFLYEYARRHPDY
SVVLLLRLAKTYETTLEKCCAAADPHECYAKVFDEFKPLVEEPQNLIKQNCELFEQLGEYK
FQNALLVRYTKKVPQVSTPTLVEVSRNLGKVGSKCCKHPEAKRMPCAEDYLSVVLNQLCV
LHEKTPVSDRVTKCCTESLVNRRPCFSALEVDETYVPKEFNAETFTFHADICTLSEKERQI
KKQTALVELVKHKPKATKEQLKAVMDDFAAFVEKCCKADDKETCFAEEGKKLVAASQAALGL
Save this sequence — you'll use it many times.
B. Generate Your Binder Seed Sequence
Our binder will start as a stretch of Glycine (G) residues — a simple placeholder. Because AlphaFold Server does not accept the "X" token, we use poly-G as a low-information stand-in.
import random
# Choose a binder length between 70 and 150 residues
binder_length = random.randint(70, 150)
binder_seq = "G" * binder_length
print(f"Binder length: {binder_length} residues")
print(f"Binder sequence: {binder_seq}")
Write down your binder_length — you'll need it later.
Success check:
- I have the HSA sequence saved
- I generated a poly-G binder sequence of length ___
- I understand the cycle: Hallucinate → Redesign → Re-predict → Repeat
Key Metrics
| Metric | What It Measures | Good Values |
|---|---|---|
| pLDDT | Per-residue confidence in the predicted structure (0–100) | > 70 is confident; > 90 is excellent |
| ipTM | Predicted quality of the interface between two chains (0–1) | > 0.6 is promising; > 0.8 is strong |
| pTM | Predicted quality of the overall fold (0–1) | Higher is better |
| PAE | Predicted Aligned Error between residue pairs (lower = better) | < 5 Å at the interface |
These are model confidence signals, not experimental measurements. Use them as relative indicators across your cycles, not as absolute proof of binding.
The Science
Protein Hunter's insight is beautifully simple:
┌─────────────────────────────┐
│ │
▼ │
Placeholder ┌──────────────┐ Hallucinated │
Sequence ───▶ │ Diffusion │ ──▶ Backbone ───┤
(all-X/G) │ Model │ │
│ (AF3/Boltz) │ │
└──────────────┘ │
│
┌──────────────┐ Redesigned │
│ ProteinMPNN │ ◀── Backbone │
│ (Inverse │ │
│ Folding) │ │
└──────┬───────┘ │
│ │
▼ │
New Sequence ──────────────────┘
-
Hallucination: Feed the model a placeholder sequence alongside a real target. The diffusion model's learned priors force it to hallucinate a well-folded backbone.
-
Inverse Folding: ProteinMPNN looks at the hallucinated backbone and asks: "What amino acid sequence would fold into this shape?"
-
Re-prediction: Feed the new sequence back into a structure-prediction model. The backbone improves.
-
Iterate: Each cycle refines both structure and sequence. Confidence scores tend to climb.
Citation: Cho, Y., Rangel, G., Bhardwaj, G., & Ovchinnikov, S. (2025). Protein Hunter: exploiting structure hallucination within diffusion for protein design. bioRxiv. https://doi.org/10.1101/2025.10.10.681530
Why this route exists
In earlier routes you learned to wrangle data, write functions, and explore protein sequences. Now you step into the world those tools were built for: computational protein design.
A recent preprint called Protein Hunter (Cho et al., 2025) showed something remarkable: you can start from literally nothing — a string of unknown amino acids — and coax a diffusion-based structure-prediction model into hallucinating a well-folded protein backbone. Then, by cycling between sequence redesign and structure re-prediction, you can iteratively improve the design.
In this route, you'll recreate a class-friendly version of this pipeline manually — using web interfaces and Colab notebooks to understand each step. You'll design a novel binder for Human Serum Albumin (HSA).
Success in this route = you can run the loop, keep clean logs, and explain metric changes — not that you "made a real binder." ipTM may stay low or fluctuate. That's normal. Document it and explain why.
By the end, you can:
- Use AlphaFold Server to predict protein complex structures
- Convert between CIF and PDB formats using BioPython
- Run ProteinMPNN and SolubleMPNN for inverse folding
- Use the Boltz-2 API for structure prediction
- Track confidence metrics (ipTM, pLDDT) across design cycles
- Explain the structure ↔ sequence cycling approach
Route: The Hallucination Ascent (Manual)
- RouteID: R024
- Wall: Protein Design (W07)
- Grade: 5.11c
- Routesetter: Abhiram
- Time: ~2-3 hours (multi-session recommended)
- Target Protein: Human Serum Albumin (HSA)
- Key Paper: Protein Hunter — Cho et al., 2025
This is a hard route. You will use four different platforms, handle real structural biology files (CIF/PDB), and execute a cutting-edge protein design workflow. Take your time.
🧗 Base Camp
Start here and climb your way up!