š Sent!
You made it to the top. Submit your work above!
Submission
Deliverable
Submit one notebook that includes:
- Parsed AF3 output scores (ipTM, min_PAE) for all 9 pools (405 pairwise observations)
- Applied size correction using the Todor et al. formula
- Pool classification table: which pools contain a PPI?
- For positive pools: identified the specific interacting protein pair
- Decoded the secret message
- Brief interpretation of your analysis approach
Mission checklist
- Loaded
protein_info.csvand all 9 pool JSON files - Extracted
chain_pair_iptmandchain_pair_pae_minfrom each pool - Computed size-corrected ipTM using Todor et al. formula
- Classified each pool as PPI-positive or PPI-negative with justification
- Identified the interacting pair in each positive pool
- Decoded the hidden message by extracting amino acids from protein sequences
- Wrote brief interpretation
Exercise 4: Decode the Message
If your analysis is correct, the positive pools reveal a hidden message ā encoded in the actual protein sequences.
The encoding
The secret message is hidden in the amino acid sequences of the interacting proteins. For each positive pool, you'll extract two letters (one from each protein in the interacting pair).
Step 1: Rank your positive pools from highest to lowest sc_ipTM
Step 2: For each positive pool, identify the interacting pair (protein_i and protein_j)
Step 3: Look up each protein's sequence in protein_info.csv
Step 4: Extract amino acids at these positions:
| Rank | From protein_i | Position | From protein_j | Position |
|---|---|---|---|---|
| 1 (highest sc_ipTM) | ā | 35 | ā | 8 |
| 2 | ā | 31 | ā | 4 |
| 3 | ā | 55 | ā | 10 |
| 4 | ā | 6 | ā | 114 |
| 5 (lowest sc_ipTM) | ā | 5 | ā | 16 |
Which protein is protein_i vs protein_j?
When you built your DataFrame, you extracted pairs with chain_i < chain_j. This means:
- protein_i = the protein at the lower chain index (e.g., chain 0)
- protein_j = the protein at the higher chain index (e.g., chain 1)
Sanity check: In protein_info.csv, look up your interacting pair. The protein with chain_idx = 0 is protein_i. The protein with chain_idx = 1 (or higher) is protein_j.
Verification example: If one of your positive pools has interacting pair b2283 + b2286, check protein_info.csv:
- b2283 has chain_idx = 0 ā this is protein_i ā extract from position in the "protein_i" column
- b2286 has chain_idx = 1 ā this is protein_j ā extract from position in the "protein_j" column
If you get the proteins swapped, your message will be gibberish ā so double-check!
Step 5: Concatenate all 10 amino acids (2 per pool Ć 5 pools) to reveal the message
Example
If your rank-1 pool's interacting pair is b1234 + b5678:
- Look up
b1234's sequence ā extract the amino acid at position 35 - Look up
b5678's sequence ā extract the amino acid at position 8 - These two letters are the first two characters of your message
Repeat for all 5 ranked pools. The 10 amino acids spell a recognizable phrase.
Your task
What is the secret message?
Write it in a markdown cell. If it doesn't spell anything recognizable, double-check:
- Did you identify the correct interacting pairs?
- Did you rank the pools correctly by sc_ipTM?
- Did you extract from the right positions?
Exercise 3: Classify Pools and Identify Pairs
Now use your size-corrected scores to classify each pool.
Part 1 ā Pool Classification
For each of the 9 pools (A through I), determine:
- Does this pool contain a protein-protein interaction?
- What quantitative evidence supports your classification?
Hint: Pool together all 405 sc_ipTM values. The vast majority are non-interacting pairs ā this gives you your background distribution. Look for outliers that clearly separate from the background.
Deliverable: A table with columns:
| Pool | Contains PPI? | Top sc_ipTM | Top min_PAE | Justification |
|---|
Part 2 ā Pair Identification
For each pool you classified as PPI-positive:
- Which specific pair of proteins is interacting?
- What are their protein_ids (e.g., "b1234 + b5678")?
- What are the sc_ipTM and min_PAE values for this pair?
Deliverable: A table for each positive pool:
| Pool | Interacting Pair | sc_ipTM | min_PAE |
|---|
Decision guidance
Look at the distribution of your sc_ipTM values:
- What stands out from the background?
- How does min_PAE complement sc_ipTM?
True interacting pairs should stand out clearly from the background in at least one metric (ideally both).
Exercise 2: Size Correction
Bigger proteins tend to get higher ipTM scores just because they're bigger ā not because they actually interact. This is a known bias in AlphaFold3 (see Todor et al., 2025, Figure 2B).
The fix is simple: subtract the expected ipTM based on protein size. What's left ā the size-corrected ipTM ā tells you how much higher a pair scores than you'd expect for proteins of that size.
The formula
Apply this correction from Todor et al. to all 405 pairwise observations:
import numpy as np
size_corrected_iptm = raw_iptm - (0.0044 * np.sqrt(len_a + len_b) - 0.036)
After correction, background (non-interacting) pairs should hover around zero. Real interactions will stand out as clear positive outliers.
Reference: Todor et al. (2025), "Predicting the protein interactome of Mycoplasma genitalium with pooled AlphaFold3," Molecular Systems Biology.
Exercise 1: Parse AF3 Outputs
Extract pairwise confidence scores from all 9 pool JSON files.
Your data files
Step 1: Download from Google Drive
Open: F037 Student Materials
You'll see a folder containing:
student_materials/
āāā protein_info.csv ā protein metadata (23 KB)
āāā README.md ā data description
āāā pools/ ā folder with 9 subfolders
āāā Pool_A/
ā āāā summary_confidences_0.json
āāā Pool_B/
ā āāā summary_confidences_0.json
... (through Pool_I)
To download: Click the folder name student_materials at the top, then click the ā® menu (three dots) ā Download. Google Drive will zip everything into one file.
Step 2: Upload to Colab
- In Colab, click the folder icon (š) in the left sidebar
- Click the upload icon (ā¬ļø) and upload the zip file
- Unzip with:
!unzip student_materials.zip
You should now have protein_info.csv and the pools/ folder in your Colab environment.
What's in these files:
| File | Description |
|---|---|
protein_info.csv | Pool membership, chain indices, protein IDs, lengths, sequences |
pools/Pool_X/summary_confidences_0.json | AF3 confidence output for each pool (9 files total) |
What to extract
From each pool's summary_confidences_0.json, you need:
chain_pair_iptmā a 10Ć10 matrix of ipTM values (one per chain pair)chain_pair_pae_minā a 10Ć10 matrix of minimum PAE values
For each unique pair (i, j) where i < j, extract the ipTM and take the minimum of the two PAE values (matrix is not symmetric for PAE).
Use protein_info.csv to look up protein_id and length for each chain index in each pool.
You did this in Route 37A ā adapt your practice code.
Build your results table
Create a DataFrame with columns: pool, chain_i, chain_j, protein_i, protein_j, len_i, len_j, iptm, min_pae
This is your raw data for size correction and analysis.
Exercise 0: Setup and Context
The biological question
Which proteins in these pools physically interact?
You have AlphaFold3 predictions for 9 pools of E. coli proteins. Each pool contains 10 proteins co-folded together. Some pools contain a genuinely interacting protein pair; others do not.
Your task:
- Parse and organize the AF3 confidence scores
- Apply size correction to remove length bias
- Classify each pool as PPI-positive or PPI-negative
- For positive pools, identify the specific interacting pair
- Decode the hidden message
Dataset structure
- 9 pools (Pool_A through Pool_I)
- 10 proteins per pool ā 10-choose-2 = 45 unique pairs per pool
- 405 total pairwise observations across all 9 pools
- Some pools contain a true interacting pair; others are noise
Key metrics
| Metric | What it measures |
|---|---|
ipTM | Interface predicted TM-score ā higher suggests interaction |
min_PAE | Minimum predicted aligned error (Ć ) ā lower suggests confident geometry |
sc_ipTM | Size-corrected ipTM ā removes length bias |
This route is less scaffolded
You know how to do this from practice routes 37A and 37B. Here you're given the goal and the key formula (Todor et al. size correction). You figure out the implementation.
Suggested chatbot prompt
If you get stuck at any point, try:
"I'm working on CHEM 169 Final Route F037. I have 9 pooled AF3 predictions with 10 E. coli proteins per pool. I need to parse confidence scores, apply size correction, and figure out which pools contain real protein-protein interactions. Help me [specific task]."
Intro
In Routes 37A and 37B, you learned:
- Pairwise AF3 gives one score per protein pair
- Pooled AF3 reduces false positives through competition
- Size correction (Todor et al.) removes bias from protein length
- ipTM and PAE are imperfect but informative metrics
Now you apply all of it to classify 9 pools of E. coli proteins ā with less hand-holding.
Why precomputed data?
Running AlphaFold3 on the server can be slow (hours per pool), so we ran the predictions ahead of time. You're working with the output files from those runs ā the same JSON files you'd get if you ran AF3 yourself.
What's different from practice
| Practice | Final |
|---|---|
| Step-by-step starter code | Goal-oriented |
| Small control set (12 pairs) | Full analysis (405 pairs) |
| Interpretation questions | Escape room puzzle |
| Told which pairs are positive/negative | You discover the signal |
The escape room element
The proteins themselves encode a hidden message in their amino acid sequences. Only correct identification of the interacting pairs AND correct ranking by sc_ipTM will let you extract the right amino acids. If your analysis has errors, you'll get gibberish instead of a recognizable phrase.
Final Exam Route 037: Pooled AF3 PPI Analysis
- RouteID: F037
- Wall: Protein-Protein Interactions (W10)
- Grade: 5.11b
- Routesetter: Course Staff + Sarah V.
- Time: 1.5 hours in class + finish by end of day
- You'll need: AF3 output files, Python, scipy, plotting libraries
š§ Base Camp
Start here and climb your way up!