Navigate
←Back to Gym
← Back to Wall

The Hallucination Ascent (Automated)

Route ID: R025 β€’ Wall: W07 β€’ Released: Feb 24, 2026

5.12a
ready

πŸŽ‰ Sent!

You made it to the top. Submit your work above!

Submission

Submit your notebook here


Deliverables

Submit your completed notebook (.ipynb) with:

  1. Automated pipeline script with at least 3-5 cycles (Exercise 1)
  2. ipTM and pLDDT plots across all cycles (Exercise 1)
  3. Final metrics table (Exercise 2)
  4. Amino acid composition shift plot (Exercise 2)
  5. Reflection answers (Exercise 2)

File naming: lastname_firstname_R025.ipynb


Exercise 2: Summit Log and Reflection

Goal: Tie it all together. Present your results and reflect on what you built.

A. Final Metrics Table

Fill in your complete results:

CycleModel UsedipTMpTMpLDDT (B)Sequence Changed?
0Initial (poly-G)N/A
1MPNN β†’ BoltzYes
2SolMPNN β†’ BoltzYes
3...Yes
4...Yes
5...Yes

B. Amino Acid Composition Shift

Create your own plot showing how amino acid composition changed across cycles.

Minimum requirements:

  1. Include at least cycle 0, cycle 1, and your final cycle
  2. Normalize by sequence length (percent or fraction)
  3. Label axes and cycle names clearly
  4. Save the figure as aa_composition_shift.png

You may use any plotting style (stacked bars, grouped bars, heatmap, etc.) as long as it is readable.

C. Reflection Questions

Answer in your notebook (~10–15 sentences total):

  1. How did the automated pipeline compare to the manual process? What was easier/harder?

  2. Did metrics improve monotonically, or did you see fluctuations? What might explain the pattern?

  3. What was the biggest engineering challenge in automating the loop?

  4. If you had more compute time, what would you change about your pipeline?

  5. Protein Hunter uses the phrase "diffusion hallucination." In your own words, what does it mean for a model to "hallucinate" a protein structure?

  6. How does this connect to what you learned in earlier routes (functions, data manipulation, pandas)?

Success check:

  • Final metrics table is complete
  • Amino acid composition plot shows changes across cycles
  • Reflection questions are answered

Exercise 1: Build the Automated Pipeline

Goal: Build a fully automated Protein Hunter–style pipeline using open-source tools on free GPU compute.

The Big Picture

In the manual route, you did:

poly-G β†’ AF3 β†’ ProteinMPNN β†’ Boltz-2 β†’ SolubleMPNN β†’ AF3 β†’ ...

Now automate it. Your pipeline should:

  1. Start from a poly-G binder sequence of a chosen length
  2. Predict the complex structure using an open-source diffusion model
  3. Redesign the binder sequence using ProteinMPNN/SolubleMPNN
  4. Re-predict the structure with the new sequence
  5. Repeat for N cycles (aim for 5)
  6. Log ipTM, pLDDT, and sequence at each cycle
  7. Visualize the improvement trajectory

Important: For the first prediction step, you must use a diffusion-based structure prediction model. Non-diffusion models like ESMFold lack the diffusion denoising process that enables hallucination from placeholder tokens.

Step 1: Choose Your Platform

PlatformFree GPUTime LimitBest For
Google ColabT4 (15 GB)~12 hrs/dayQuick prototyping
KaggleP100 / T4Γ—2 (16 GB)30 hrs/weekLonger runs
Lightning AIVarious22 hrs/month freeStudio environment

Connect to a GPU in Colab: Go to Runtime β†’ Change runtime type β†’ Hardware accelerator β†’ GPU (T4)

Step 2: Open-Source Diffusion-Based Predictors

ModelInstallNotes
Boltz-1pip install boltz β€” GitHubMIT license; open-source
OpenFold 3GitHubApache 2.0; AF3 reproduction
Chai-1GitHubAcademic license

Step 3: The Pipeline Script

Build this yourself from scratch. Do not copy full scripts from the route page.

Required structure (pseudocode level):

initialize poly-G binder
for each cycle:
  predict structure for target + current binder (diffusion model)
  redesign binder sequence from predicted structure (MPNN/SolMPNN)
  choose next binder sequence
  log cycle metrics (ipTM, pTM, pLDDT, sequence)
generate summary plots + final table

Your code must include:

  1. A reusable function for each stage (prediction, redesign, logging)
  2. A loop that runs at least 3 cycles (target 5)
  3. A persistent log written to disk (.csv or .json)
  4. Clear error handling for at least one failure mode (missing file, timeout, OOM, etc.)

Step 4: Implementation Tips

Refer to official docs for installation and command syntax:

Recommended workflow:

  1. Ask your chatbot to generate a first draft pipeline from your own pseudocode
  2. Run one cycle first and debug it
  3. Scale to 3-5 cycles only after single-cycle run works
  4. Keep a short debugging log in your notebook

Memory issues? HSA is a large protein (609 aa). If you run into out-of-memory errors:

  • Reduce BINDER_LENGTH to 70–80 residues
  • Use ESMFold for intermediate validation (faster, less memory)
  • Split the work across sessions

Success check:

  • I automated the structure β†’ sequence β†’ structure loop
  • I ran at least 3–5 full cycles
  • I plotted ipTM and pLDDT across all cycles
  • I documented my metric trends (improving, flat, or fluctuating)

References

  1. Cho, Y., Rangel, G., Bhardwaj, G., & Ovchinnikov, S. (2025). Protein Hunter. bioRxiv. https://doi.org/10.1101/2025.10.10.681530

  2. Dauparas, J. et al. (2022). ProteinMPNN. Science, 378(6615), 49–56. https://doi.org/10.1126/science.add2187

  3. ProteinMPNN code: https://github.com/dauparas/ProteinMPNN

  4. Boltz-1 code: https://github.com/jwohlwend/boltz

  5. Protein Hunter code: https://github.com/yehlincho/Protein-Hunter


Why this route exists

In The Hallucination Ascent (Manual), you ran each step of the Protein Hunter pipeline by hand β€” AlphaFold Server, ProteinMPNN, Boltz-2, SolubleMPNN. You understand what each tool does and why the cycling works.

Now it's time to automate. Real protein design campaigns don't stop at 2-3 manual cycles. They run dozens or hundreds of iterations, testing different starting points, temperatures, and redesign strategies. That requires automation.

This route is about engineering: taking the conceptual loop you understand and turning it into code that runs end-to-end without manual intervention.

By the end, you can:

  • Build an automated structure ↔ sequence cycling pipeline
  • Use open-source structure prediction tools (Boltz-1, OpenFold 3)
  • Run ProteinMPNN programmatically from Python
  • Log and visualize design metrics across many cycles
  • Debug GPU memory and API rate-limit issues

Route: The Hallucination Ascent (Automated)

  • RouteID: R025
  • Wall: Protein Design (W07)
  • Grade: 5.12a
  • Routesetter: Abhiram
  • Time: ~2-3 hours
  • Prerequisites: The Hallucination Ascent (Manual)

This is a project-level route. You will write real automation code, deal with GPU memory limits, and debug API calls. Multi-session work is expected.

πŸ§— Base Camp

Start here and climb your way up!