Submission

Submit your notebook here

Deliverables

Submit your completed notebook (.ipynb) with:

Label leakage demo with accuracy (using 21 features)
Similarity matrix and clustering code (using PLM embeddings)
Both random and clustered splits with accuracies
Reflection answers in markdown cells

Exercise 4: Reflection

Goal: Cement your understanding of data leakage.

Answer in your notebook (2-3 sentences each):

In your own words, what is data leakage?
Why is homology leakage particularly sneaky? (Hint: the model IS learning something real...)
You're reviewing a paper that claims 95% accuracy on protein function prediction. What questions would you ask about their train/test split?
When might a random split actually be okay?

Exercise 3: Clustered Splits (with PLM Embeddings)

Goal: Use PLM embeddings to cluster proteins and create proper train/test splits.

Now you'll fix the homology leakage problem - but with much better features.

Step 1: Load PLM embeddings

Download the Mtb proteome embeddings: mtb_full.h5.gz

Unzip with !gunzip mtb_full.h5.gz, then load using h5py. Each protein ID is a key, and the embedding (1024 features) is the value.

Step 1.5: Align embeddings with your labeled proteins

The h5 file contains embeddings for the entire Mtb proteome (~4,000 proteins), but your localization dataset only has labels for a subset. You need to:

Get the protein IDs from your localization dataframe (from Exercise 0)
Filter the h5 embeddings to only those proteins
Make sure the order matches between your embedding matrix X and your labels y

Hint: Loop through your dataframe's protein IDs and pull the corresponding embedding from the h5 file. If a protein ID isn't in the h5 file, you'll need to skip it (and remove it from y too).

Welcome to real-world ML. Notice how we can't just load two files and go? One dataset has embeddings for everything, another has labels for a subset. Aligning data from different sources - matching IDs, handling missing values, ensuring consistent ordering - is easily 50% of any ML project. Get comfortable with it now.

Step 2: Compute pairwise similarity

Use sklearn.metrics.pairwise.cosine_similarity to compute similarity between all pairs of protein embeddings. This gives you an n × n matrix.

Step 3: Cluster by similarity

Pick a similarity threshold (start with 0.8). Two proteins are "connected" if their similarity exceeds the threshold. Use scipy.sparse.csgraph.connected_components to find clusters.

Step 4: Clustered split

Instead of splitting proteins randomly, split the cluster IDs randomly. All proteins in a cluster go to the same set (train or test).

Step 5: Compare

Train a classifier on the PLM embeddings using your clustered split. Record accuracy.
Train again using a standard random split (ignoring clusters). Record accuracy.

What you might see: The clustered split often gives lower accuracy. That's not bad - it's honest. The random split accuracy was inflated by homology leakage.

Success check:

You loaded PLM embeddings and aligned them with your labeled proteins
Your embedding matrix X and label vector y have the same number of rows
You computed the similarity matrix and clustered the proteins
You have accuracy for both split methods

Exercise 2: The Real-World Problem (Homology Leakage)

Goal: Understand why random splits fail for protein data.

Here's something that actually happens in protein ML:

Proteins in the same family are similar - they share evolutionary history, similar sequences, and often similar functions. If you do a random train/test split, related proteins end up on both sides. Your model learns to recognize protein families, not to generalize.

Example: Imagine training a model to predict if a protein is an enzyme. If kinase A is in training and its close homolog kinase B is in test, the model might get kinase B right just because it's similar to kinase A - not because it learned what makes something an enzyme.

This is why many protein ML papers have inflated accuracy that doesn't hold up on truly novel proteins.

Questions to consider:

If two proteins share 90% sequence identity, should they be in different splits?
What about 50% identity? 30%?
How would you even know which proteins are similar?

Exercise 1: The Extreme Example (Label Leakage)

Goal: See what happens when you accidentally include the answer in your features.

This is a teaching example - you wouldn't actually do this in practice. But it makes the concept of leakage crystal clear.

Using your 21 hand-crafted features from Exercise 0:

Encode your location labels as numbers (e.g., cytoplasm=0, membrane=1, secreted=2)
Add this encoded column as a 22nd feature
Split, train a model, and check accuracy

What you should see: Near-perfect accuracy (95-100%). The model just learned "if feature 22 equals 1, predict membrane."

Why this matters: This is an extreme case, but subtler versions happen. Any feature that's derived from or strongly correlated with the label can cause leakage.

Success check:

You see suspiciously high accuracy
You understand why this is useless for real predictions

Exercise 0: Setup

Goal: Prepare your data with hand-crafted features.

Load your cleaned Mtb localization dataset (in case you need it: mtb_with_localization.xlsx)
Build your 21 hand-crafted features from R017: sequence length + 20 amino acid counts
Create your label vector y with cleaned localization labels

We'll use these simple features first to demonstrate label leakage clearly, then switch to PLM embeddings in Exercise 3.

Success check:

X has shape (n_proteins, 21)
y has your cleaned location labels

Why this route exists

You've trained models. You've compared their accuracy. But here's a dirty secret: accuracy is easy to fake.

If information about your test set leaks into your training process, your model will look amazing on paper but fail on truly new data. This is called data leakage, and it's one of the most common reasons ML projects fail in the real world.

In this route, you'll see two types of leakage:

An extreme teaching example that makes the concept crystal clear
A real-world example from protein ML that actually bites researchers

And you'll learn to fix it by computing protein similarity and clustering with PLM embeddings.

What you'll be able to do after this route

By the end, you can:

Explain what data leakage is and why it's dangerous
Recognize common sources of leakage in ML pipelines
Compute protein similarity from PLM embeddings
Cluster proteins and create proper train/test splits

Key definitions

Data leakage When information from outside the training set influences model training or evaluation. Makes the model appear better than it actually is.

Label leakage When the label (or information derived from it) accidentally appears in the features. Extreme and obvious, but useful for understanding the concept.

Homology leakage When similar proteins (homologs) appear in both train and test sets. The model memorizes protein families instead of learning generalizable patterns. This is a real problem in protein ML.

Route 019: The Leaky Pipeline

RouteID: 019
Wall: The Machine Learning Offwidth (W06)
Grade: 5.8
Routesetter: Adrian
Time: ~35 minutes
Dataset: Mtb localization data + PLM embeddings