Navigate
←Back to Gym
← Back to Wall

Practice Draft 37A - Pairwise AF3 Baseline (Control Set)

Route ID: R037A β€’ Wall: W10 β€’ Released: Mar 5, 2026

5.11a
published

πŸŽ‰ Sent!

You made it to the top. Submit your work above!

Submission

Submit your notebook here


Deliverable

Submit a Colab notebook containing your work for this route. Your notebook will likely include preprocessing, JSON generation for AF3 batch upload, score parsing, and more β€” that's all great.

At minimum, your notebook must have:

  1. Your pairwise_scores.csv loaded and displayed (sorted by ipTM, descending)
  2. Two visualizations comparing positive vs decoy pairs (one for ipTM, one for PAE)
  3. Answers to the interpretation questions from Exercise 5 (in a markdown cell)

Your CSV should have 12 rows (6 positive + 6 decoy pairs) with columns: pair_id, type, iptm, ptm, pae_off_diag_avg.


Exercise 5: Visualization and Interpretation

Visualizations

Create two plots comparing positive vs decoy pairs:

  1. ipTM comparison β€” bar chart or box plot showing ipTM by pair type
  2. PAE comparison β€” bar chart or box plot showing PAE (off-diagonal) by pair type

Your chatbot can help with the plotting code (seaborn or matplotlib work great).

Interpretation

In a markdown cell, answer:

  1. What's the average ipTM for positive pairs vs decoy pairs?
  2. What's the average PAE (off-diagonal) for positive pairs vs decoy pairs?
  3. Is there clear separation between the two groups on both metrics?
  4. Which pair scored best (high ipTM, low PAE)? Which scored worst?
  5. Based on these results, do ipTM and PAE seem useful for distinguishing real interactions from random pairs?

Exercise 4: Parse All Scores

Now that you have all 12 job outputs, build your pairwise score table.

Your goal: Create a CSV with one row per pair containing:

  • pair_id (extracted from filename)
  • type (positive or decoy)
  • iptm
  • ptm
  • pae_off_diag_avg (average of the off-diagonal values in chain_pair_pae_min)

Starter code β€” parsing one JSON file:

import json

with open('pos_01_summary.json') as f:
    d = json.load(f)

iptm = d['iptm']
ptm = d['ptm']
pae_matrix = d['chain_pair_pae_min']
pae_off_diag_avg = (pae_matrix[0][1] + pae_matrix[1][0]) / 2

Your job: loop through all 12 files and build a table.


πŸ…°οΈ Path A (Colab): Upload your summary JSON files to Colab. Write Python code to loop through them, parse each one, and build a DataFrame. Your chatbot can help.


πŸ…±οΈ Path B (Local agent): This is where coding agents really shine. Point Claude Code (or similar) at your extracted folder and ask it to find all summary_confidences_0.json files, extract the scores, and output a merged CSV. Done in seconds β€” no Colab upload needed.


Save your final table as pairwise_scores.csv.


Exercise 3: Batch Submission

You've submitted 2 jobs by hand. Now let's submit the remaining 10 more efficiently.

AlphaFold Server lets you upload multiple job requests at once using a JSON file. Instead of manually entering sequences one by one, you'll:

  1. Convert your CSV data into the JSON format AF3 expects
  2. Upload that JSON file to create all 10 jobs as drafts
  3. Submit each draft (still requires clicking, but no more copy-pasting sequences)

πŸ“Ί Video walkthrough: Coming soon β€” follow the written steps below (see the batch upload section)

Step 1: Build the JSON file

Write Python code to convert your CSV into a JSON file that AF3 Server accepts.

Resources:

Each job in the JSON needs: a name, two protein chains with sequences, and the required dialect/version fields. Use your chatbot to help write the conversion code.

Important: Exclude the 2 pairs you already submitted manually.

Step 2: Submit each draft

Important: Uploading a JSON file creates drafts, not running jobs. You must submit each draft individually:

  1. Upload your JSON file to AF3 Server (this creates 10 drafts)
  2. Click on each draft β†’ Continue β†’ Preview Job β†’ Confirm and Submit
  3. Repeat for all 10 drafts
  4. Monitor progress on the AF3 Server dashboard
  5. Download results when jobs complete

Yes, this is tedious. Click, click, click, click...

Annoyed yet? Route 37C (worth 8 routes of extra credit) challenges you to build a browser automation tool that handles this clicking for you. If the repetition is driving you crazy, channel that frustration into engineering.

Downloading and organizing your outputs

When you select completed jobs on AF3 Server and click Download, you get one big zip containing all selected jobs. Each job folder inside has MSA files (huge), templates, and multiple model seeds.

The problem: These zips are 50-150 MB each β€” mostly MSA files you don't need.

Choose your path:


πŸ…°οΈ Path A: Colab (familiar, guided)

  1. Download the zip to your local machine
  2. Extract it (double-click on Mac/Windows)
  3. You'll see folders like pw_001_pos_01_.../, pw_002_pos_02_.../, etc.
  4. Inside each folder, find *_summary_confidences_0.json β€” ignore _1, _2, _3, _4 (we only need model 0)
  5. Copy just those 12 small JSON files (~350 bytes each) to a folder
  6. Upload that folder to Colab
  7. Write Python code to parse the filenames and extract scores (your chatbot can help)

πŸ…±οΈ Path B: Local coding agent (powerful, real-world)

Use Claude Code, Cursor, or another AI coding agent to do the extraction locally:

  1. Download and extract the zip
  2. Point your agent at the folder and ask it to:
    • Find all *_summary_confidences_0.json files
    • Extract iptm, ptm, and chain_pair_pae_min from each
    • Parse the pair ID from the filename
    • Output a merged CSV

This is how professionals handle messy data wrangling β€” the agent does it in seconds.

New to coding agents? Check out the AI-Assisted Coding wall (W11) for setup guides.



Exercise 2: Explore AF3 Outputs

Once your 2 manual jobs complete, download the results and learn what's inside before scaling up.

What's in each job folder?

AF3 produces 5 model seeds (numbered 0–4) per job β€” these are independent predictions with slightly different results. For this route, we'll use model 0 only to keep things simple.

File patternWhat it containsSize
*_summary_confidences_0.jsonKey metrics β€” ipTM, pTM, chain_pair_pae_min~350 bytes
*_full_data_0.jsonDetailed data β€” full PAE matrix, per-residue pLDDT~500 KB
*_model_0.cif3D structure file~165 KB
*_job_request.jsonYour input (useful as batch template)~1 KB
msas/ folderMultiple sequence alignmentsHuge (MB each)
templates/ folderStructural templatesLarge

You only need *_summary_confidences_0.json β€” ignore the _1, _2, _3, _4 variants and all the other files.

Get the summary files

  1. Download your completed jobs from AF3 Server (you'll get one zip file)
  2. Extract it locally β€” you'll see folders for each job
  3. Inside each folder, find *_summary_confidences_0.json (ignore _1, _2, etc. β€” we only need model 0)
  4. For these 2 test files, you can manually copy them to Colab or explore locally

Find the key metrics

Open each JSON file and locate:

  1. iptm β€” the interface predicted TM-score (0–1). Higher = more confident the chains interact.
  2. chain_pair_pae_min β€” a 2Γ—2 matrix of minimum PAE between chains. The off-diagonal values ([0][1] and [1][0]) show inter-chain confidence. Lower = better.

Compare positive vs decoy

Look at your two pairs:

  • Does the positive pair have higher ipTM than the decoy?
  • Does the positive pair have lower off-diagonal PAE?

This comparison gives you intuition before you scale up to all 12 pairs.

Note: The full PAE matrix lives in full_data_0.json under the pae key, if you want to dig deeper later.


Exercise 1: Manual Submission

Submit 2 jobs by hand to learn the AlphaFold Server interface.

πŸ“Ί Video walkthrough: Coming soon β€” follow the written steps below

Steps

  1. Go to AlphaFold Server
  2. Pick one positive pair (e.g., POS_01) and one decoy pair (e.g., DEC_01) from your CSV
  3. Grab the sequences from sequence_a and sequence_b β€” either from your Colab notebook or by opening the CSV in Excel
  4. For each pair: create a new job with 2 protein chains and paste the sequences
  5. Name your jobs clearly using UniProt IDs (e.g., POS_01_P9WHU1_P9WHT9)
  6. Submit and wait for results (~5–10 min)

Exercise 0: Setup and Files

You are given one CSV file for this route:

  1. r037A_pairwise_jobs_with_sequences.csv

This is your run sheet β€” a table of protein pairs to submit to AlphaFold3 in pairwise mode. Each row is one AF3 job: two proteins that you'll predict as a complex. Sequences are included β€” no need to fetch from UniProt.

What are Rv IDs? The rv_a and rv_b columns contain Rv numbers (e.g., Rv2109c). These are systematic gene identifiers for Mycobacterium tuberculosis H37Rv β€” the reference strain used in TB research. Every gene in the genome gets an "Rv" number based on its position. Think of them like street addresses for Mtb genes.

The file contains 22 pairs split into two categories:

  • POS_* rows (10 pairs): Known interactors β€” proteins with strong experimental evidence of physical interaction in the STRING database
  • DEC_* rows (12 pairs): Random decoys β€” proteins with no known interaction, serving as negative controls

For this route, you will submit 12 pairs: POS_01 through POS_06 and DEC_01 through DEC_06. This keeps the workload manageable while still giving you a mix of positives and negatives to compare.

Your goal is to see whether AF3's ipTM scores can distinguish real interactions from background noise.

πŸ“Š How was this dataset built?

The positive pairs come from the STRING database for Mycobacterium tuberculosis (taxid 83332). We selected pairs with combined_score β‰₯ 900 (STRING scores range 0–1000, so 900+ indicates high-confidence interactions). This gave us ~1,857 unique pairs total, from which we sampled for this exercise.

Decoy pairs are random pairs of proteins that are not expected to interact in real biology. We selected proteins with low experimental evidence (≀100 in STRING) and paired them randomly from the Mtb proteome. These serve as negative controls β€” if AF3 gives them high scores, that's a false positive.

Dataset curation: Sarah VeskimΓ€gi

What to do right now

  1. Download r037A_pairwise_jobs_with_sequences.csv and load it in a Google Colab notebook
  2. Explore the columns β€” sequences are in sequence_a and sequence_b
  3. Filter to the 12 pairs you'll submit: POS_01–POS_06 and DEC_01–DEC_06
  4. Move on to Exercise 1

Route goal: Build a clear pairwise baseline of ipTM scores. In Route 37B, you will compare this baseline against pooled-AF3 results.


Intro

In this route, you will explore a core biological question:

Which proteins physically interact with each other?

We have talked a lot about proteins as individual molecules, but in real cells most functions come from protein-protein interactions (PPIs):

  • enzyme + regulator complexes
  • transport and signaling assemblies
  • multi-subunit molecular machines

For this exercise, you will use AlphaFold3 as a computational tool to test candidate PPIs in pairwise mode (one pair at a time).

These two routes (37A/37B) are inspired by:

  • Todor et al. (2026), Predicting the protein interaction landscape of a free-living bacterium with pooled-AlphaFold3 DOI: 10.1038/s44320-026-00189-7

Concept sketch

Protein A + Protein B  --(AF3 prediction)-->  predicted complex
                                         |
                                         v
                              interface confidence (ipTM)

If a pair gets high interface confidence, that is evidence supporting interaction. But pairwise runs can also produce spurious high scores, so this route is intentionally a baseline.

In Route 37B, you will compare this baseline against pooled competition (real pairs mixed with decoys), which is usually more realistic and stricter.

The goal is not to blindly trust one score, but to build scientific judgment under noisy evidence.

Suggested prompt exploration

"I am new to protein-protein interactions. Explain PPIs in simple terms, then explain what AF3 pairwise prediction is measuring, what ipTM means, and why high ipTM does not automatically prove a real biological interaction."

"Teach me the minimum PPI background I need for this route: what true positives and decoys are, why we run both, and what kinds of false positives can appear in pairwise AF3 results."

"I uploaded the pooled-AF3 paper (Todor et al., 2026). Summarize it for a CHEM 169 student, focusing on: (1) pairwise vs pooled strategy, (2) why pooled runs reduce false positives, (3) the size-bias correction formula, and (4) why performance is good but imperfect."


Route 037A: Pairwise AF3 Baseline

  • RouteID: 037A
  • Wall: Protein-Protein Interactions (W10)
  • Grade: 5.11a
  • Routesetter: Adrian + Sarah V.
  • Time: ~40-60 minutes (+ queue time)
  • You'll need: Instructor-provided pair files and AF3 account.

πŸ§— Base Camp

Start here and climb your way up!