Submission

Submit your notebook here

Deliverables

Submit your completed notebook (.ipynb) with:

Automated pipeline script with at least 3-5 cycles (Exercise 1)
ipTM and pLDDT plots across all cycles (Exercise 1)
Final metrics table (Exercise 2)
Amino acid composition shift plot (Exercise 2)
Reflection answers (Exercise 2)

File naming: lastname_firstname_R025.ipynb

Exercise 2: Summit Log and Reflection

Goal: Tie it all together. Present your results and reflect on what you built.

A. Final Metrics Table

Fill in your complete results:

Cycle	Model Used	Sequence Changed?
0	Initial (poly-G)	N/A
1	MPNN → Boltz	Yes
2	SolMPNN → Boltz	Yes
3	...	Yes
4	...	Yes
5	...	Yes

B. Amino Acid Composition Shift

Create your own plot showing how amino acid composition changed across cycles.

Minimum requirements:

Include at least cycle 0, cycle 1, and your final cycle
Normalize by sequence length (percent or fraction)
Label axes and cycle names clearly
Save the figure as aa_composition_shift.png

You may use any plotting style (stacked bars, grouped bars, heatmap, etc.) as long as it is readable.

C. Reflection Questions

Answer in your notebook (~10–15 sentences total):

How did the automated pipeline compare to the manual process? What was easier/harder?
Did metrics improve monotonically, or did you see fluctuations? What might explain the pattern?
What was the biggest engineering challenge in automating the loop?
If you had more compute time, what would you change about your pipeline?
Protein Hunter uses the phrase "diffusion hallucination." In your own words, what does it mean for a model to "hallucinate" a protein structure?
How does this connect to what you learned in earlier routes (functions, data manipulation, pandas)?

Success check:

Final metrics table is complete
Amino acid composition plot shows changes across cycles
Reflection questions are answered

Exercise 1: Build the Automated Pipeline

Goal: Build a fully automated Protein Hunter–style pipeline using open-source tools on free GPU compute.

The Big Picture

In the manual route, you did:

poly-G → AF3 → ProteinMPNN → Boltz-2 → SolubleMPNN → AF3 → ...

Now automate it. Your pipeline should:

Start from a poly-G binder sequence of a chosen length
Predict the complex structure using an open-source diffusion model
Redesign the binder sequence using ProteinMPNN/SolubleMPNN
Re-predict the structure with the new sequence
Repeat for N cycles (aim for 5)
Log ipTM, pLDDT, and sequence at each cycle
Visualize the improvement trajectory

Important: For the first prediction step, you must use a diffusion-based structure prediction model. Non-diffusion models like ESMFold lack the diffusion denoising process that enables hallucination from placeholder tokens.

Step 1: Choose Your Platform

Platform	Free GPU	Time Limit	Best For
Google Colab	T4 (15 GB)	~12 hrs/day	Quick prototyping
Kaggle	P100 / T4×2 (16 GB)	30 hrs/week	Longer runs
Lightning AI	Various	22 hrs/month free	Studio environment

Connect to a GPU in Colab: Go to Runtime → Change runtime type → Hardware accelerator → GPU (T4)

Step 2: Open-Source Diffusion-Based Predictors

Model	Install	Notes
Boltz-1	`pip install boltz` — GitHub	MIT license; open-source
OpenFold 3	GitHub	Apache 2.0; AF3 reproduction
Chai-1	GitHub	Academic license

Step 3: The Pipeline Script

Build this yourself from scratch. Do not copy full scripts from the route page.

Required structure (pseudocode level):

initialize poly-G binder
for each cycle:
  predict structure for target + current binder (diffusion model)
  redesign binder sequence from predicted structure (MPNN/SolMPNN)
  choose next binder sequence
  log cycle metrics (ipTM, pTM, pLDDT, sequence)
generate summary plots + final table

Your code must include:

A reusable function for each stage (prediction, redesign, logging)
A loop that runs at least 3 cycles (target 5)
A persistent log written to disk (.csv or .json)
Clear error handling for at least one failure mode (missing file, timeout, OOM, etc.)

Step 4: Implementation Tips

Refer to official docs for installation and command syntax:

Recommended workflow:

Ask your chatbot to generate a first draft pipeline from your own pseudocode
Run one cycle first and debug it
Scale to 3-5 cycles only after single-cycle run works
Keep a short debugging log in your notebook

Memory issues? HSA is a large protein (609 aa). If you run into out-of-memory errors:

Reduce BINDER_LENGTH to 70–80 residues

Use ESMFold for intermediate validation (faster, less memory)

Split the work across sessions

Success check:

I automated the structure → sequence → structure loop
I ran at least 3–5 full cycles
I plotted ipTM and pLDDT across all cycles
I documented my metric trends (improving, flat, or fluctuating)

References

Cho, Y., Rangel, G., Bhardwaj, G., & Ovchinnikov, S. (2025). Protein Hunter. bioRxiv. https://doi.org/10.1101/2025.10.10.681530
Dauparas, J. et al. (2022). ProteinMPNN. Science, 378(6615), 49–56. https://doi.org/10.1126/science.add2187
ProteinMPNN code: https://github.com/dauparas/ProteinMPNN
Boltz-1 code: https://github.com/jwohlwend/boltz
Protein Hunter code: https://github.com/yehlincho/Protein-Hunter

Why this route exists

In The Hallucination Ascent (Manual), you ran each step of the Protein Hunter pipeline by hand — AlphaFold Server, ProteinMPNN, Boltz-2, SolubleMPNN. You understand what each tool does and why the cycling works.

Now it's time to automate. Real protein design campaigns don't stop at 2-3 manual cycles. They run dozens or hundreds of iterations, testing different starting points, temperatures, and redesign strategies. That requires automation.

This route is about engineering: taking the conceptual loop you understand and turning it into code that runs end-to-end without manual intervention.

By the end, you can:

Build an automated structure ↔ sequence cycling pipeline
Use open-source structure prediction tools (Boltz-1, OpenFold 3)
Run ProteinMPNN programmatically from Python
Log and visualize design metrics across many cycles
Debug GPU memory and API rate-limit issues

Route: The Hallucination Ascent (Automated)

RouteID: R025
Wall: Protein Design (W07)
Grade: 5.12a
Routesetter: Abhiram
Time: ~2-3 hours
Prerequisites: The Hallucination Ascent (Manual)

This is a project-level route. You will write real automation code, deal with GPU memory limits, and debug API calls. Multi-session work is expected.