Submission

Submit your notebook here

Deliverables

Submit your completed notebook (.ipynb) with:

HSA sequence saved and poly-G binder generated (Exercise 0)
AlphaFold Server results with ipTM/pTM/pLDDT recorded (Exercise 1)
ProteinMPNN redesigned sequence with amino acid composition (Exercise 2)
Boltz-2 re-prediction with updated metrics (Exercise 3)
SolubleMPNN refinement in your own Colab notebook (Exercise 4)
Validation results with improvement plot (Exercise 5)
Reflection answers (Exercise 6)

File naming: lastname_firstname_R024.ipynb

Exercise 6: Reflection

Goal: Consolidate what you learned from the manual pipeline.

Answer in your notebook (2-3 sentences each):

What did the poly-G sequence look like structurally after the first prediction? Was it compact or disordered?
How did the structure change after the first ProteinMPNN redesign?
At which cycle did you first see improvement in ipTM? If metrics didn't improve, what might explain that?
What was the biggest challenge — the science, the coding, or wrangling the different platforms?
Protein Hunter uses the phrase "diffusion hallucination." In your own words, what does it mean for a model to "hallucinate" a protein structure?
Why is iterative cycling better than a single pass?

What's next? If you want to automate this entire loop, continue to the next route: The Hallucination Ascent (Automated).

Exercise 5: Validation with AlphaFold3

Goal: Feed the SolubleMPNN-refined sequence back into AlphaFold Server to validate improvement.

Step 1: Submit to AlphaFold Server

Return to https://alphafoldserver.com and create a new prediction:

Chain A: HSA (same as Exercise 1)
Chain B: Your SolubleMPNN-redesigned sequence

Name the job HSA_binder_cycle2_solMPNN.

Step 2: Record the Improvement

When the results come back, record all metrics. Fill in this table:

Metric	Cycle 0 (poly-G)	Cycle 1 (MPNN → Boltz-2)	Cycle 2 (SolMPNN → AF3)
ipTM	___	___	___
pTM	___	___	___
pLDDT	___	___ (complex)	___

Note: Cycle 1 uses Boltz-2 which reports complex-level pLDDT. Cycles 0 and 2 use AlphaFold Server which provides per-chain pLDDT — use chain B's value for those.

Step 3: Visualize the Improvement

import matplotlib.pyplot as plt

cycles = [0, 1, 2]
iptm_scores = [0.0, 0.0, 0.0]   # ← Fill in your real values!
plddt_scores = [0.0, 0.0, 0.0]  # ← Fill in your real values!

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# ipTM plot
ax1.plot(cycles, iptm_scores, 'o-', color='#d63384', linewidth=2, markersize=10)
ax1.set_xlabel("Cycle", fontsize=13)
ax1.set_ylabel("ipTM", fontsize=13)
ax1.set_title("Interface Confidence (ipTM) Across Cycles", fontsize=14)
ax1.set_ylim(0, 1)
ax1.axhline(y=0.6, color='gray', linestyle='--', alpha=0.5, label='Promising threshold')
ax1.axhline(y=0.8, color='green', linestyle='--', alpha=0.5, label='Strong threshold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# pLDDT plot
ax2.plot(cycles, plddt_scores, 's-', color='#0d6efd', linewidth=2, markersize=10)
ax2.set_xlabel("Cycle", fontsize=13)
ax2.set_ylabel("pLDDT", fontsize=13)
ax2.set_title("Structural Confidence (pLDDT) Across Cycles", fontsize=14)
ax2.set_ylim(0, 100)
ax2.axhline(y=70, color='gray', linestyle='--', alpha=0.5, label='Confident threshold')
ax2.axhline(y=90, color='green', linestyle='--', alpha=0.5, label='Excellent threshold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("cycle_improvement.png", dpi=150)
plt.show()

Logbook Question

Look at how the ipTM changed from cycle 0 to cycle 2. Why do you think the confidence metrics improve (or don't) with each cycle? Connect your answer to the structure ↔ sequence cycling concept from the introduction.

Success check:

I submitted the SolMPNN sequence to AlphaFold Server
ipTM changed from cycle 0: ___ → cycle 2: ___
I created the improvement plot

Exercise 4: SolubleMPNN Refinement

Goal: Use SolubleMPNN for a second round of inverse folding, biasing toward soluble, expressible proteins.

Why SolubleMPNN?

Standard ProteinMPNN optimizes for foldability alone. SolubleMPNN uses weights trained by excluding transmembrane protein structures, biasing designs toward proteins that are soluble — meaning they won't aggregate or crash out of solution when expressed in the lab. This is critical for real-world protein therapeutics.

The soluble weights ship with the ProteinMPNN repo in the soluble_model_weights/ directory. See the ProteinMPNN repository for full documentation.

Step 1: Set Up Your Own Colab Notebook

This is where it gets real. Unlike the previous steps, there's no pre-made Colab for SolubleMPNN — you will build one yourself.

Connect to a GPU in Colab: Go to Runtime → Change runtime type → Hardware accelerator → GPU (T4)

Create a new Google Colab notebook and run the following setup:

# Cell 1: Clone ProteinMPNN
!git clone -q https://github.com/dauparas/ProteinMPNN.git
%cd ProteinMPNN

import os
print("Soluble weights available:", os.listdir("soluble_model_weights"))

# Cell 2: Upload your JSON from Boltz-2 and extract the mmCIF structure
from google.colab import files
import json

uploaded = files.upload()  # Upload the JSON file from Exercise 3

json_filename = list(uploaded.keys())[0]
print(f"Uploaded: {json_filename}")

# Load the Boltz-2 JSON output
with open(json_filename) as f:
    boltz_data = json.load(f)

# Extract the embedded mmCIF structure and save it
# The structure is stored as a string inside the JSON
cif_content = boltz_data["structure"]  # or check the actual key name in your JSON
cif_filename = "boltz2_structure.cif"
with open(cif_filename, "w") as f:
    f.write(cif_content)
print(f"Extracted structure to: {cif_filename}")

# Cell 3: Convert CIF to PDB (ProteinMPNN expects PDB format)
!pip install biopython -q

from Bio.PDB import MMCIFParser, PDBIO

parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("complex", cif_filename)

io = PDBIO()
io.set_structure(structure)
io.save("complex.pdb")
print("Converted to complex.pdb")

# Cell 4: Run SolubleMPNN on chain B only
!python protein_mpnn_run.py \
    --pdb_path "complex.pdb" \
    --pdb_path_chains "B" \
    --out_folder "solmpnn_output" \
    --num_seq_per_target 4 \
    --sampling_temp "0.1" \
    --use_soluble_model

Fallback: If --use_soluble_model isn't recognized, manually point to the weights: --path_to_model_weights soluble_model_weights/

# Cell 5: Read the output sequences
import glob

fasta_files = glob.glob("solmpnn_output/seqs/*.fa")
for f in sorted(fasta_files):
    with open(f) as handle:
        print(handle.read())

Step 2: Pick the Best Sequence

From the output sequences, pick the one with the lowest score (most confident). Record it.

# Your SolubleMPNN-redesigned sequence
solmpnn_seq = "YOUR_BEST_SEQUENCE_HERE"
print(f"SolMPNN sequence length: {len(solmpnn_seq)}")

Success check:

I set up my own Colab notebook for SolubleMPNN
I converted CIF → PDB using BioPython
I ran SolubleMPNN with --use_soluble_model flag
I have a new redesigned sequence of length ___

Exercise 3: Re-Prediction with Boltz-2

Goal: Feed the redesigned sequence back into a structure prediction model (Boltz-2) to refine the backbone.

Why Boltz-2?

The Protein Hunter paper found that Boltz-2 achieved the highest in silico success rate among all diffusion-based models tested. Two key reasons:

Token-level encoding: Boltz-2 represents each non-canonical residue as a single learned token.
Rigid-body alignment: Boltz-2 applies rigid-body alignment at every reverse diffusion step.

Step 1: Navigate to NVIDIA's Boltz-2 API

Go to: https://build.nvidia.com/mit/boltz2

This is a free API endpoint hosted by NVIDIA — no GPU required on your end.

Step 2: Set Up the Prediction

You need to create an input with two chains:

Chain A — HSA (same sequence as before)

Chain B — Your redesigned sequence from Exercise 2.

Submit the prediction via the web UI or the API.

Programmatic approach: See the Boltz-2 API Reference for Python snippets using requests.post().

Step 3: Download the Output

Download the resulting JSON file from Boltz-2. The JSON contains:

Confidence metrics (ipTM, pLDDT for the complex)
An embedded mmCIF structure (the predicted 3D coordinates)

Note: Boltz-2 outputs JSON format. Inside the JSON, you'll find an mmCIF-formatted string containing the structure — you'll need to extract and save this as a .cif file for use in Exercise 4.

Step 4: Compare Metrics

Record the new confidence metrics:

Metric	Cycle 0 (AlphaFold3, poly-G)	Cycle 1 (Boltz-2, redesigned)
ipTM	___	___
pLDDT (complex)	___	___

Note: Boltz-2 reports pLDDT for the entire complex, not per-chain. Use the complex-level pLDDT for comparison.

Note: Metrics may not improve monotonically — fluctuations and plateaus are common, especially in early cycles. This is normal. Document the trend.

Success check:

I submitted redesigned chain B + HSA to Boltz-2
I downloaded the JSON output file
I recorded updated metrics: ipTM = ___, pLDDT (complex) = ___

Exercise 2: ProteinMPNN Inverse Folding

Goal: Take the hallucinated backbone of chain B and redesign its sequence using ProteinMPNN.

The Science

ProteinMPNN (Dauparas et al., 2022) is an inverse folding model. Normal folding goes sequence → structure. ProteinMPNN goes structure → sequence: given a backbone, it predicts what amino acid sequence would fold into that shape.

Step 1: Convert CIF → PDB

ProteinMPNN's command-line interface expects PDB format. In a fresh Colab notebook:

!pip -q install biopython

from Bio.PDB import MMCIFParser, PDBIO

cif_path = "YOUR_DOWNLOADED_MODEL.cif"   # ← upload from your AlphaFold Server zip
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("complex", cif_path)

io = PDBIO()
io.set_structure(structure)
io.save("cycle0_complex.pdb")
print("Saved PDB → cycle0_complex.pdb")

Step 2: Open a ProteinMPNN Notebook

Option A — ProteinMPNN Quick Demo (Colab): https://colab.research.google.com/github/dauparas/ProteinMPNN/blob/main/colab_notebooks/quickdemo.ipynb

Option B — ColabDesign wrapper (JAX): https://colab.research.google.com/github/sokrypton/ColabDesign/blob/v1.1.1/mpnn/examples/proteinmpnn_in_jax.ipynb

Option C — Run from the cloned repo directly:

!git clone -q https://github.com/dauparas/ProteinMPNN.git
%cd ProteinMPNN

!python protein_mpnn_run.py \
    --pdb_path ../cycle0_complex.pdb \
    --pdb_path_chains "B" \
    --out_folder ../mpnn_cycle0_out \
    --num_seq_per_target 8 \
    --sampling_temp "0.1"

Flag	What It Does
`--pdb_path`	Path to the input PDB structure
`--pdb_path_chains "B"`	Which chains to redesign (everything else is held fixed)
`--num_seq_per_target`	How many sequences to sample
`--sampling_temp "0.1"`	Temperature (string). Lower = more conservative designs

Step 3: Record the Output

Open the FASTA output in mpnn_cycle0_out/seqs/ and pick the top sequence. Record the output sequence — this is your first real binder candidate!

Step 4: Inspect the Sequence

# Paste your redesigned sequence here
redesigned_seq_cycle1 = "YOUR_SEQUENCE_HERE"

# Quick stats
print(f"Length: {len(redesigned_seq_cycle1)} aa")

# Amino acid composition
from collections import Counter
aa_counts = Counter(redesigned_seq_cycle1)
print("Top 5 amino acids:")
for aa, count in aa_counts.most_common(5):
    print(f"  {aa}: {count} ({100*count/len(redesigned_seq_cycle1):.1f}%)")

Success check:

I uploaded the CIF to ProteinMPNN Colab
I selected Chain B only for redesign
I have a redesigned amino acid sequence of length ___
I noted the amino acid composition

Exercise 1: AlphaFold Server Prediction

Goal: Use AlphaFold3 to predict what a poly-G chain might look like when folded alongside HSA.

Step 1: Navigate to AlphaFold Server

Go to https://alphafoldserver.com

Plan for server quotas — AlphaFold Server has usage limits. Don't assume you can submit unlimited runs in a single sitting. Space your jobs across sessions.

Step 2: Set Up the Job

Click "New prediction"

Create two protein chains:

Chain A — Human Serum Albumin (the target):

MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAKTCVADESAENCDKSLHTLFGDKLCTVATLRETYGEMADCCAKQEPERNECFLQHKDDNPNLPRLVRPEVDVMCTAFHDNEETFLKKYLYEIARRHPYFYAPELLFFAKRYKAAFTECCQAADKAACLLPKLDELRDEGKASSAKQRLKCASLQKFGERAFKAWAVARLSQRFPKAEFAEVSKLVTDLTKVHTECCHGDLLECADDRADLAKYICENQDSISSKLKECCEKPLLEKSHCIAEVENDEMPADLPSLAADFVESKDVCKNYAEAKDVFLGMFLYEYARRHPDYSVVLLLRLAKTYETTLEKCCAAADPHECYAKVFDEFKPLVEEPQNLIKQNCELFEQLGEYKFQNALLVRYTKKVPQVSTPTLVEVSRNLGKVGSKCCKHPEAKRMPCAEDYLSVVLNQLCVLHEKTPVSDRVTKCCTESLVNRRPCFSALEVDETYVPKEFNAETFTFHADICTLSEKERQIKKQTALVELVKHKPKATKEQLKAVMDDFAAFVEKCCKADDKETCFAEEGKKLVAASQAALGL

Chain B — Your poly-G binder sequence from Exercise 0

Name the job HSA_polyG_binder_cycle0
Click "Submit" and wait (usually 5–30 minutes)

Step 3: Download & Understand the Results

Download the results zip. Inside you'll find:

File pattern	What It Contains
`fold_..._model_0.cif`	Rank-0 (best) predicted structure
`fold_..._summary_confidences_0.json`	Summary metrics: ipTM, pTM, ranking_score
`fold_..._full_data_0.json`	Per-atom pLDDT, full PAE matrix

Step 4: Parse the JSON

import json
from pathlib import Path
import numpy as np

def load_af3_summary(path):
    """Load ipTM/pTM from an AlphaFold Server summary JSON."""
    data = json.loads(Path(path).read_text())
    return {
        "iptm":                data.get("iptm"),
        "ptm":                 data.get("ptm"),
        "ranking_score":       data.get("ranking_score"),
        "fraction_disordered": data.get("fraction_disordered"),
        "has_clash":           data.get("has_clash"),
    }

def mean_plddt_for_chain(full_data_json_path, chain_id="B"):
    """Compute mean pLDDT for a specific chain."""
    data = json.loads(Path(full_data_json_path).read_text())
    atom_plddts   = np.array(data["atom_plddts"], dtype=float)
    atom_chain_ids = np.array(data["atom_chain_ids"])
    mask = (atom_chain_ids == chain_id)
    return float(atom_plddts[mask].mean())

Step 5: Logbook Entry

Answer in your notebook:

What is the ipTM score from cycle 0?
Does the binder chain (B) look like it contacts HSA, or is it floating away?
What's your intuition — does a poly-G sequence "deserve" a high confidence score?

Success check:

I submitted a 2-chain prediction to AlphaFold Server
I downloaded the results zip
I recorded ipTM = ___, pTM = ___, pLDDT (chain B) ≈ ___

Exercise 0: Setup and Target

Goal: Prepare your workspace and understand your target protein.

A. Know Your Target: Human Serum Albumin (HSA)

HSA is a ~66.5 kDa transport protein and the most abundant protein in blood plasma. It ferries fatty acids, hormones, drugs, and metal ions through the bloodstream. Designing a protein that binds HSA is therapeutically relevant — HSA-binding peptides are used to extend the half-life of drugs in the body.

Here is the HSA sequence (UniProt P02768). Paste it as a single continuous string with no spaces or line breaks:

MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFE
DHVKLVNEVTEFAKTCVADESAENCDKSLHTLFGDKLCTVATLRETYGEMADCCAKQEPER
NECFLQHKDDNPNLPRLVRPEVDVMCTAFHDNEETFLKKYLYEIARRHPYFYAPELLFFA
KRYKAAFTECCQAADKAACLLPKLDELRDEGKASSAKQRLKCASLQKFGERAFKAWAVARL
SQRFPKAEFAEVSKLVTDLTKVHTECCHGDLLECADDRADLAKYICENQDSISSKLKECCE
KPLLEKSHCIAEVENDEMPADLPSLAADFVESKDVCKNYAEAKDVFLGMFLYEYARRHPDY
SVVLLLRLAKTYETTLEKCCAAADPHECYAKVFDEFKPLVEEPQNLIKQNCELFEQLGEYK
FQNALLVRYTKKVPQVSTPTLVEVSRNLGKVGSKCCKHPEAKRMPCAEDYLSVVLNQLCV
LHEKTPVSDRVTKCCTESLVNRRPCFSALEVDETYVPKEFNAETFTFHADICTLSEKERQI
KKQTALVELVKHKPKATKEQLKAVMDDFAAFVEKCCKADDKETCFAEEGKKLVAASQAALGL

Save this sequence — you'll use it many times.

B. Generate Your Binder Seed Sequence

Our binder will start as a stretch of Glycine (G) residues — a simple placeholder. Because AlphaFold Server does not accept the "X" token, we use poly-G as a low-information stand-in.

import random

# Choose a binder length between 70 and 150 residues
binder_length = random.randint(70, 150)
binder_seq = "G" * binder_length

print(f"Binder length: {binder_length} residues")
print(f"Binder sequence: {binder_seq}")

Write down your binder_length — you'll need it later.

Success check:

I have the HSA sequence saved
I generated a poly-G binder sequence of length ___
I understand the cycle: Hallucinate → Redesign → Re-predict → Repeat

Key Metrics

Metric	What It Measures	Good Values
pLDDT	Per-residue confidence in the predicted structure (0–100)	> 70 is confident; > 90 is excellent
ipTM	Predicted quality of the interface between two chains (0–1)	> 0.6 is promising; > 0.8 is strong
pTM	Predicted quality of the overall fold (0–1)	Higher is better
PAE	Predicted Aligned Error between residue pairs (lower = better)	< 5 Å at the interface

These are model confidence signals, not experimental measurements. Use them as relative indicators across your cycles, not as absolute proof of binding.

The Science

Protein Hunter's insight is beautifully simple:

                   ┌─────────────────────────────┐
                   │                             │
                   ▼                             │
  Placeholder   ┌──────────────┐   Hallucinated  │
  Sequence ───▶ │  Diffusion   │ ──▶ Backbone ───┤
  (all-X/G)     │  Model       │                 │
                │ (AF3/Boltz)  │                 │
                └──────────────┘                 │
                                                 │
                ┌──────────────┐   Redesigned    │
                │ ProteinMPNN  │ ◀── Backbone    │
                │ (Inverse     │                 │
                │  Folding)    │                 │
                └──────┬───────┘                 │
                       │                         │
                       ▼                         │
                  New Sequence ──────────────────┘

Hallucination: Feed the model a placeholder sequence alongside a real target. The diffusion model's learned priors force it to hallucinate a well-folded backbone.
Inverse Folding: ProteinMPNN looks at the hallucinated backbone and asks: "What amino acid sequence would fold into this shape?"
Re-prediction: Feed the new sequence back into a structure-prediction model. The backbone improves.
Iterate: Each cycle refines both structure and sequence. Confidence scores tend to climb.

Citation: Cho, Y., Rangel, G., Bhardwaj, G., & Ovchinnikov, S. (2025). Protein Hunter: exploiting structure hallucination within diffusion for protein design. bioRxiv. https://doi.org/10.1101/2025.10.10.681530

Why this route exists

In earlier routes you learned to wrangle data, write functions, and explore protein sequences. Now you step into the world those tools were built for: computational protein design.

A recent preprint called Protein Hunter (Cho et al., 2025) showed something remarkable: you can start from literally nothing — a string of unknown amino acids — and coax a diffusion-based structure-prediction model into hallucinating a well-folded protein backbone. Then, by cycling between sequence redesign and structure re-prediction, you can iteratively improve the design.

In this route, you'll recreate a class-friendly version of this pipeline manually — using web interfaces and Colab notebooks to understand each step. You'll design a novel binder for Human Serum Albumin (HSA).

Success in this route = you can run the loop, keep clean logs, and explain metric changes — not that you "made a real binder." ipTM may stay low or fluctuate. That's normal. Document it and explain why.

By the end, you can:

Use AlphaFold Server to predict protein complex structures
Convert between CIF and PDB formats using BioPython
Run ProteinMPNN and SolubleMPNN for inverse folding
Use the Boltz-2 API for structure prediction
Track confidence metrics (ipTM, pLDDT) across design cycles
Explain the structure ↔ sequence cycling approach

Route: The Hallucination Ascent (Manual)

RouteID: R024
Wall: Protein Design (W07)
Grade: 5.11c
Routesetter: Abhiram
Time: ~2-3 hours (multi-session recommended)
Target Protein: Human Serum Albumin (HSA)
Key Paper: Protein Hunter — Cho et al., 2025

This is a hard route. You will use four different platforms, handle real structural biology files (CIF/PDB), and execute a cutting-edge protein design workflow. Take your time.