Navigate
←Back to Gym
← Back to Wall

Final Exam Route 37 - Pooled AF3 PPI Analysis

Route ID: F037 • Wall: W10 • Released: Mar 16, 2026

5.11b
ready

šŸŽ‰ Sent!

You made it to the top. Submit your work above!

Submission

Submit your notebook here


Deliverable

Submit one notebook that includes:

  1. Parsed AF3 output scores (ipTM, min_PAE) for all 9 pools (405 pairwise observations)
  2. Applied size correction using the Todor et al. formula
  3. Pool classification table: which pools contain a PPI?
  4. For positive pools: identified the specific interacting protein pair
  5. Decoded the secret message
  6. Brief interpretation of your analysis approach

Mission checklist

  • Loaded protein_info.csv and all 9 pool JSON files
  • Extracted chain_pair_iptm and chain_pair_pae_min from each pool
  • Computed size-corrected ipTM using Todor et al. formula
  • Classified each pool as PPI-positive or PPI-negative with justification
  • Identified the interacting pair in each positive pool
  • Decoded the hidden message by extracting amino acids from protein sequences
  • Wrote brief interpretation

Exercise 4: Decode the Message

If your analysis is correct, the positive pools reveal a hidden message — encoded in the actual protein sequences.

The encoding

The secret message is hidden in the amino acid sequences of the interacting proteins. For each positive pool, you'll extract two letters (one from each protein in the interacting pair).

Step 1: Rank your positive pools from highest to lowest sc_ipTM

Step 2: For each positive pool, identify the interacting pair (protein_i and protein_j)

Step 3: Look up each protein's sequence in protein_info.csv

Step 4: Extract amino acids at these positions:

RankFrom protein_iPositionFrom protein_jPosition
1 (highest sc_ipTM)āœ“35āœ“8
2āœ“31āœ“4
3āœ“55āœ“10
4āœ“6āœ“114
5 (lowest sc_ipTM)āœ“5āœ“16

Which protein is protein_i vs protein_j?

When you built your DataFrame, you extracted pairs with chain_i < chain_j. This means:

  • protein_i = the protein at the lower chain index (e.g., chain 0)
  • protein_j = the protein at the higher chain index (e.g., chain 1)

Sanity check: In protein_info.csv, look up your interacting pair. The protein with chain_idx = 0 is protein_i. The protein with chain_idx = 1 (or higher) is protein_j.

Verification example: If one of your positive pools has interacting pair b2283 + b2286, check protein_info.csv:

  • b2283 has chain_idx = 0 → this is protein_i → extract from position in the "protein_i" column
  • b2286 has chain_idx = 1 → this is protein_j → extract from position in the "protein_j" column

If you get the proteins swapped, your message will be gibberish — so double-check!

Step 5: Concatenate all 10 amino acids (2 per pool Ɨ 5 pools) to reveal the message

Example

If your rank-1 pool's interacting pair is b1234 + b5678:

  • Look up b1234's sequence → extract the amino acid at position 35
  • Look up b5678's sequence → extract the amino acid at position 8
  • These two letters are the first two characters of your message

Repeat for all 5 ranked pools. The 10 amino acids spell a recognizable phrase.

Your task

What is the secret message?

Write it in a markdown cell. If it doesn't spell anything recognizable, double-check:

  • Did you identify the correct interacting pairs?
  • Did you rank the pools correctly by sc_ipTM?
  • Did you extract from the right positions?

Exercise 3: Classify Pools and Identify Pairs

Now use your size-corrected scores to classify each pool.

Part 1 — Pool Classification

For each of the 9 pools (A through I), determine:

  • Does this pool contain a protein-protein interaction?
  • What quantitative evidence supports your classification?

Hint: Pool together all 405 sc_ipTM values. The vast majority are non-interacting pairs — this gives you your background distribution. Look for outliers that clearly separate from the background.

Deliverable: A table with columns:

PoolContains PPI?Top sc_ipTMTop min_PAEJustification

Part 2 — Pair Identification

For each pool you classified as PPI-positive:

  • Which specific pair of proteins is interacting?
  • What are their protein_ids (e.g., "b1234 + b5678")?
  • What are the sc_ipTM and min_PAE values for this pair?

Deliverable: A table for each positive pool:

PoolInteracting Pairsc_ipTMmin_PAE

Decision guidance

Look at the distribution of your sc_ipTM values:

  • What stands out from the background?
  • How does min_PAE complement sc_ipTM?

True interacting pairs should stand out clearly from the background in at least one metric (ideally both).


Exercise 2: Size Correction

Bigger proteins tend to get higher ipTM scores just because they're bigger — not because they actually interact. This is a known bias in AlphaFold3 (see Todor et al., 2025, Figure 2B).

The fix is simple: subtract the expected ipTM based on protein size. What's left — the size-corrected ipTM — tells you how much higher a pair scores than you'd expect for proteins of that size.

The formula

Apply this correction from Todor et al. to all 405 pairwise observations:

import numpy as np

size_corrected_iptm = raw_iptm - (0.0044 * np.sqrt(len_a + len_b) - 0.036)

After correction, background (non-interacting) pairs should hover around zero. Real interactions will stand out as clear positive outliers.

Reference: Todor et al. (2025), "Predicting the protein interactome of Mycoplasma genitalium with pooled AlphaFold3," Molecular Systems Biology.


Exercise 1: Parse AF3 Outputs

Extract pairwise confidence scores from all 9 pool JSON files.

Your data files

Step 1: Download from Google Drive

Open: F037 Student Materials

You'll see a folder containing:

student_materials/
ā”œā”€ā”€ protein_info.csv      ← protein metadata (23 KB)
ā”œā”€ā”€ README.md             ← data description
└── pools/                ← folder with 9 subfolders
    ā”œā”€ā”€ Pool_A/
    │   └── summary_confidences_0.json
    ā”œā”€ā”€ Pool_B/
    │   └── summary_confidences_0.json
    ... (through Pool_I)

To download: Click the folder name student_materials at the top, then click the ā‹® menu (three dots) → Download. Google Drive will zip everything into one file.

Step 2: Upload to Colab

  1. In Colab, click the folder icon (šŸ“) in the left sidebar
  2. Click the upload icon (ā¬†ļø) and upload the zip file
  3. Unzip with: !unzip student_materials.zip

You should now have protein_info.csv and the pools/ folder in your Colab environment.

What's in these files:

FileDescription
protein_info.csvPool membership, chain indices, protein IDs, lengths, sequences
pools/Pool_X/summary_confidences_0.jsonAF3 confidence output for each pool (9 files total)

What to extract

From each pool's summary_confidences_0.json, you need:

  • chain_pair_iptm — a 10Ɨ10 matrix of ipTM values (one per chain pair)
  • chain_pair_pae_min — a 10Ɨ10 matrix of minimum PAE values

For each unique pair (i, j) where i < j, extract the ipTM and take the minimum of the two PAE values (matrix is not symmetric for PAE).

Use protein_info.csv to look up protein_id and length for each chain index in each pool.

You did this in Route 37A — adapt your practice code.

Build your results table

Create a DataFrame with columns: pool, chain_i, chain_j, protein_i, protein_j, len_i, len_j, iptm, min_pae

This is your raw data for size correction and analysis.


Exercise 0: Setup and Context

The biological question

Which proteins in these pools physically interact?

You have AlphaFold3 predictions for 9 pools of E. coli proteins. Each pool contains 10 proteins co-folded together. Some pools contain a genuinely interacting protein pair; others do not.

Your task:

  1. Parse and organize the AF3 confidence scores
  2. Apply size correction to remove length bias
  3. Classify each pool as PPI-positive or PPI-negative
  4. For positive pools, identify the specific interacting pair
  5. Decode the hidden message

Dataset structure

  • 9 pools (Pool_A through Pool_I)
  • 10 proteins per pool → 10-choose-2 = 45 unique pairs per pool
  • 405 total pairwise observations across all 9 pools
  • Some pools contain a true interacting pair; others are noise

Key metrics

MetricWhat it measures
ipTMInterface predicted TM-score — higher suggests interaction
min_PAEMinimum predicted aligned error (ƅ) — lower suggests confident geometry
sc_ipTMSize-corrected ipTM — removes length bias

This route is less scaffolded

You know how to do this from practice routes 37A and 37B. Here you're given the goal and the key formula (Todor et al. size correction). You figure out the implementation.

Suggested chatbot prompt

If you get stuck at any point, try:

"I'm working on CHEM 169 Final Route F037. I have 9 pooled AF3 predictions with 10 E. coli proteins per pool. I need to parse confidence scores, apply size correction, and figure out which pools contain real protein-protein interactions. Help me [specific task]."


Intro

In Routes 37A and 37B, you learned:

  • Pairwise AF3 gives one score per protein pair
  • Pooled AF3 reduces false positives through competition
  • Size correction (Todor et al.) removes bias from protein length
  • ipTM and PAE are imperfect but informative metrics

Now you apply all of it to classify 9 pools of E. coli proteins — with less hand-holding.

Why precomputed data?

Running AlphaFold3 on the server can be slow (hours per pool), so we ran the predictions ahead of time. You're working with the output files from those runs — the same JSON files you'd get if you ran AF3 yourself.

What's different from practice

PracticeFinal
Step-by-step starter codeGoal-oriented
Small control set (12 pairs)Full analysis (405 pairs)
Interpretation questionsEscape room puzzle
Told which pairs are positive/negativeYou discover the signal

The escape room element

The proteins themselves encode a hidden message in their amino acid sequences. Only correct identification of the interacting pairs AND correct ranking by sc_ipTM will let you extract the right amino acids. If your analysis has errors, you'll get gibberish instead of a recognizable phrase.


Final Exam Route 037: Pooled AF3 PPI Analysis

  • RouteID: F037
  • Wall: Protein-Protein Interactions (W10)
  • Grade: 5.11b
  • Routesetter: Course Staff + Sarah V.
  • Time: 1.5 hours in class + finish by end of day
  • You'll need: AF3 output files, Python, scipy, plotting libraries

šŸ§— Base Camp

Start here and climb your way up!