Submission

Submit your notebook here

Deliverable

Submit one notebook that includes:

Parsed AF3 output scores (ipTM, min_PAE) for all 9 pools (405 pairwise observations)
Applied size correction using the Todor et al. formula
Pool classification table: which pools contain a PPI?
For positive pools: identified the specific interacting protein pair
Decoded the secret message
Brief interpretation of your analysis approach

Mission checklist

Loaded protein_info.csv and all 9 pool JSON files
Extracted chain_pair_iptm and chain_pair_pae_min from each pool
Computed size-corrected ipTM using Todor et al. formula
Classified each pool as PPI-positive or PPI-negative with justification
Identified the interacting pair in each positive pool
Decoded the hidden message by extracting amino acids from protein sequences
Wrote brief interpretation

Exercise 4: Decode the Message

If your analysis is correct, the positive pools reveal a hidden message — encoded in the actual protein sequences.

The encoding

The secret message is hidden in the amino acid sequences of the interacting proteins. For each positive pool, you'll extract two letters (one from each protein in the interacting pair).

Step 1: Rank your positive pools from highest to lowest sc_ipTM

Step 2: For each positive pool, identify the interacting pair (protein_i and protein_j)

Step 3: Look up each protein's sequence in protein_info.csv

Step 4: Extract amino acids at these positions:

Rank	From protein_i	Position	From protein_j	Position
1 (highest sc_ipTM)	✓	35	✓	8
2	✓	31	✓	4
3	✓	55	✓	10
4	✓	6	✓	114
5 (lowest sc_ipTM)	✓	5	✓	16

Which protein is protein_i vs protein_j?

When you built your DataFrame, you extracted pairs with chain_i < chain_j. This means:

protein_i = the protein at the lower chain index (e.g., chain 0)
protein_j = the protein at the higher chain index (e.g., chain 1)

Sanity check: In protein_info.csv, look up your interacting pair. The protein with chain_idx = 0 is protein_i. The protein with chain_idx = 1 (or higher) is protein_j.

Verification example: If one of your positive pools has interacting pair b2283 + b2286, check protein_info.csv:

b2283 has chain_idx = 0 → this is protein_i → extract from position in the "protein_i" column
b2286 has chain_idx = 1 → this is protein_j → extract from position in the "protein_j" column

If you get the proteins swapped, your message will be gibberish — so double-check!

Step 5: Concatenate all 10 amino acids (2 per pool × 5 pools) to reveal the message

Example

If your rank-1 pool's interacting pair is b1234 + b5678:

Look up b1234's sequence → extract the amino acid at position 35
Look up b5678's sequence → extract the amino acid at position 8
These two letters are the first two characters of your message

Repeat for all 5 ranked pools. The 10 amino acids spell a recognizable phrase.

Your task

What is the secret message?

Write it in a markdown cell. If it doesn't spell anything recognizable, double-check:

Did you identify the correct interacting pairs?
Did you rank the pools correctly by sc_ipTM?
Did you extract from the right positions?

Exercise 3: Classify Pools and Identify Pairs

Now use your size-corrected scores to classify each pool.

Part 1 — Pool Classification

For each of the 9 pools (A through I), determine:

Does this pool contain a protein-protein interaction?
What quantitative evidence supports your classification?

Hint: Pool together all 405 sc_ipTM values. The vast majority are non-interacting pairs — this gives you your background distribution. Look for outliers that clearly separate from the background.

Deliverable: A table with columns:

Pool	Contains PPI?	Top sc_ipTM	Top min_PAE	Justification

Part 2 — Pair Identification

For each pool you classified as PPI-positive:

Which specific pair of proteins is interacting?
What are their protein_ids (e.g., "b1234 + b5678")?
What are the sc_ipTM and min_PAE values for this pair?

Deliverable: A table for each positive pool:

Pool	Interacting Pair	sc_ipTM	min_PAE

Decision guidance

Look at the distribution of your sc_ipTM values:

What stands out from the background?
How does min_PAE complement sc_ipTM?

True interacting pairs should stand out clearly from the background in at least one metric (ideally both).

Exercise 2: Size Correction

Bigger proteins tend to get higher ipTM scores just because they're bigger — not because they actually interact. This is a known bias in AlphaFold3 (see Todor et al., 2025, Figure 2B).

The fix is simple: subtract the expected ipTM based on protein size. What's left — the size-corrected ipTM — tells you how much higher a pair scores than you'd expect for proteins of that size.

The formula

Apply this correction from Todor et al. to all 405 pairwise observations:

import numpy as np

size_corrected_iptm = raw_iptm - (0.0044 * np.sqrt(len_a + len_b) - 0.036)

After correction, background (non-interacting) pairs should hover around zero. Real interactions will stand out as clear positive outliers.

Reference: Todor et al. (2025), "Predicting the protein interactome of Mycoplasma genitalium with pooled AlphaFold3," Molecular Systems Biology.

Exercise 1: Parse AF3 Outputs

Extract pairwise confidence scores from all 9 pool JSON files.

Your data files

Step 1: Download from Google Drive

Open: F037 Student Materials

You'll see a folder containing:

student_materials/
├── protein_info.csv      ← protein metadata (23 KB)
├── README.md             ← data description
└── pools/                ← folder with 9 subfolders
    ├── Pool_A/
    │   └── summary_confidences_0.json
    ├── Pool_B/
    │   └── summary_confidences_0.json
    ... (through Pool_I)

To download: Click the folder name student_materials at the top, then click the ⋮ menu (three dots) → Download. Google Drive will zip everything into one file.

Step 2: Upload to Colab

In Colab, click the folder icon (📁) in the left sidebar
Click the upload icon (⬆️) and upload the zip file
Unzip with: !unzip student_materials.zip

You should now have protein_info.csv and the pools/ folder in your Colab environment.

What's in these files:

File	Description
`protein_info.csv`	Pool membership, chain indices, protein IDs, lengths, sequences
`pools/Pool_X/summary_confidences_0.json`	AF3 confidence output for each pool (9 files total)

What to extract

From each pool's summary_confidences_0.json, you need:

chain_pair_iptm — a 10×10 matrix of ipTM values (one per chain pair)
chain_pair_pae_min — a 10×10 matrix of minimum PAE values

For each unique pair (i, j) where i < j, extract the ipTM and take the minimum of the two PAE values (matrix is not symmetric for PAE).

Use protein_info.csv to look up protein_id and length for each chain index in each pool.

You did this in Route 37A — adapt your practice code.

Build your results table

Create a DataFrame with columns: pool, chain_i, chain_j, protein_i, protein_j, len_i, len_j, iptm, min_pae

This is your raw data for size correction and analysis.

Exercise 0: Setup and Context

The biological question

Which proteins in these pools physically interact?

You have AlphaFold3 predictions for 9 pools of E. coli proteins. Each pool contains 10 proteins co-folded together. Some pools contain a genuinely interacting protein pair; others do not.

Your task:

Parse and organize the AF3 confidence scores
Apply size correction to remove length bias
Classify each pool as PPI-positive or PPI-negative
For positive pools, identify the specific interacting pair
Decode the hidden message

Dataset structure

9 pools (Pool_A through Pool_I)
10 proteins per pool → 10-choose-2 = 45 unique pairs per pool
405 total pairwise observations across all 9 pools
Some pools contain a true interacting pair; others are noise

Key metrics

Metric	What it measures
`ipTM`	Interface predicted TM-score — higher suggests interaction
`min_PAE`	Minimum predicted aligned error (Å) — lower suggests confident geometry
`sc_ipTM`	Size-corrected ipTM — removes length bias

This route is less scaffolded

You know how to do this from practice routes 37A and 37B. Here you're given the goal and the key formula (Todor et al. size correction). You figure out the implementation.

Suggested chatbot prompt

If you get stuck at any point, try:

"I'm working on CHEM 169 Final Route F037. I have 9 pooled AF3 predictions with 10 E. coli proteins per pool. I need to parse confidence scores, apply size correction, and figure out which pools contain real protein-protein interactions. Help me [specific task]."

Intro

In Routes 37A and 37B, you learned:

Pairwise AF3 gives one score per protein pair
Pooled AF3 reduces false positives through competition
Size correction (Todor et al.) removes bias from protein length
ipTM and PAE are imperfect but informative metrics

Now you apply all of it to classify 9 pools of E. coli proteins — with less hand-holding.

Why precomputed data?

Running AlphaFold3 on the server can be slow (hours per pool), so we ran the predictions ahead of time. You're working with the output files from those runs — the same JSON files you'd get if you ran AF3 yourself.

What's different from practice

Practice	Final
Step-by-step starter code	Goal-oriented
Small control set (12 pairs)	Full analysis (405 pairs)
Interpretation questions	Escape room puzzle
Told which pairs are positive/negative	You discover the signal

The escape room element

The proteins themselves encode a hidden message in their amino acid sequences. Only correct identification of the interacting pairs AND correct ranking by sc_ipTM will let you extract the right amino acids. If your analysis has errors, you'll get gibberish instead of a recognizable phrase.

Final Exam Route 037: Pooled AF3 PPI Analysis

RouteID: F037
Wall: Protein-Protein Interactions (W10)
Grade: 5.11b
Routesetter: Course Staff + Sarah V.
Time: 1.5 hours in class + finish by end of day
You'll need: AF3 output files, Python, scipy, plotting libraries