🎉 Sent!
You made it to the top. Submit your work above!
Submission
Deliverable
Submit one notebook that includes:
- ESM-2 residue-level embeddings for all 18 proteins
- Pairwise inter-protein similarity matrix (18x18 heatmap)
- Identification of the protein cluster (which proteins group together?)
- Full inter-protein heatmaps for 2-3 top pairs from the cluster
- Identification of the shared domain
- Brief biological interpretation
Mission checklist
- Generated ESM-2 embeddings for all 18 proteins
- Computed pairwise inter-protein similarity
- Built and plotted similarity heatmap
- Identified the tight cluster of proteins
- Generated full inter-protein heatmaps for top cluster pairs
- Named the shared domain and answered verification questions
- Wrote biological interpretation
Exercise 4: Name the Domain
You've found a cluster. You've looked up what those proteins have in common. Now prove you understand what you found.
The verification
Answer these three questions in a markdown cell:
-
What is the name of the shared structural domain that unites the clustered proteins?
-
What is the approximate size (in residues) of this domain?
-
Name one biological function this domain is known for.
Students who did the analysis correctly and looked up their clustered proteins will get this right. Students who didn't will guess wrong.
Exercise 3: Investigate the Cluster
This is where the science gets exciting — you've found a pattern in the data, now explain it.
Part 1 — Domain hypothesis
Look up each protein in your cluster:
- What domains do they contain?
- Is there a shared structural element?
- What biological function might drive the similarity?
Tools you can use:
- UniProt (uniprot.org) — search by ID, check "Family & Domains"
- InterPro (ebi.ac.uk/interpro) — domain annotations
- Your chatbot — ask it to help interpret domain annotations
Deliverable: A markdown cell stating:
- What domain/motif the clustered proteins share
- Where in the sequence this domain is located
- Why ESM-2 would detect this similarity
Part 2 — Full inter-protein heatmaps
For the top 2-3 pairs from your cluster (highest pairwise similarity):
- Generate the full window-vs-window cosine similarity matrix
- Plot as a heatmap
- Mark where the high-similarity hot spots occur
- Do the hot spot positions match the shared domain locations?
You built this in Route 36A/36B — adapt your practice code.
Deliverable: 2-3 inter-protein heatmaps with hot spots annotated.
Exercise 2: Build the Similarity Matrix
Compute pairwise inter-protein similarity for all protein pairs.
The metric
For each pair of proteins (A, B):
- Compute all window embeddings for both proteins (you know how to do this)
- Calculate cosine similarity between all window pairs
- Summarize with the top-10 mean — the average of the 10 highest similarities
This gives you one number per protein pair. You'll have N×(N-1)/2 pairs total.
Your task
Build an N×N similarity matrix and plot it as a heatmap.
You have the building blocks from practice:
get_window_embeddings()from Route 36Acosine_similarity()from sklearn
Now combine them to build the full pairwise matrix. Use your chatbot if you get stuck, but try it yourself first.
Pro tip: Reorder with hierarchical clustering
A raw heatmap shows proteins in arbitrary order — any cluster will be scattered and hard to see. Reorder the matrix using hierarchical clustering to group similar proteins together:
from scipy.cluster.hierarchy import linkage, leaves_list
from scipy.spatial.distance import squareform
# Convert similarity to distance
dist_matrix = 1 - similarity_matrix
np.fill_diagonal(dist_matrix, 0)
# Cluster and get new order
Z = linkage(squareform(dist_matrix), method='average')
order = leaves_list(Z)
# Reorder matrix and labels
reordered_matrix = similarity_matrix[np.ix_(order, order)]
reordered_proteins = [proteins[i] for i in order]
Now plot reordered_matrix with reordered_proteins as labels. Similar proteins will be adjacent, making any cluster obvious.
Checkpoint
Look at your reordered heatmap:
- Do you notice any group of proteins with elevated similarity to each other?
- How many proteins are in that group?
- Which ones?
List the protein IDs you identify as the cluster.
Exercise 1: Generate ESM-2 Embeddings
Generate residue-level ESM-2 embeddings for all 18 proteins.
Your data
Download from Google Drive: F036 Student Materials
The CSV contains protein IDs and sequences. You'll look up what they are as part of your investigation.
Model setup
!pip -q install fair-esm
import torch
import esm
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
Your task
Write a loop that generates embeddings for all 18 proteins and stores them in a dictionary.
You did this in Route 36A for 10-20 sampled proteins. Adapt your practice code for this dataset.
Checkpoint: Each embedding should have shape (sequence_length, 1280).
Exercise 0: Setup
Runtime
- GPU runtime required (Colab T4/L4/A100)
- CPU will be too slow for 18 proteins
Install dependencies
!pip -q install fair-esm pandas matplotlib seaborn scikit-learn
The roadmap
18 protein sequences
↓
ESM-2 embeddings
↓
Pairwise similarity matrix
↓
Find the cluster
↓
Investigate WHY
↓
Name the domain
Don't worry about understanding every step yet — just get your runtime set up and keep climbing.
Suggested chatbot prompt
If you get stuck at any point, try:
"I'm working on CHEM 169 Final Route F036. I have 18 proteins and need to find which ones share a structural domain using ESM-2 embeddings and pairwise cosine similarity. Help me [specific task]."
Intro
You've already built the core tools — ESM embeddings, window cosine similarity, heatmaps. In Routes 36A and 36B, you learned to find internal repeats and compare protein regions.
Now you're applying those skills to a real discovery problem: can you find hidden structural relationships in a set of 18 mystery proteins?
The mystery
These 18 proteins hide a secret. Some of them share something in common — a structural feature that ESM-2 can detect. Your job is to find the hidden group and figure out what connects them.
Your task
- Find which proteins cluster together
- Figure out what they have in common
- Name the shared domain
Dataset
- 18 proteins — some share a domain, others don't
- You don't know which is which — that's what you're discovering
Final Exam Route 036: PLM Protein Clustering
- RouteID: F036
- Wall: Protein Representations (W05)
- Grade: 5.10b
- Routesetter: Course Staff
- Time: 1.5 hours in class + finish by end of day
- You'll need: GPU runtime, ESM-2, 18 protein sequences, plotting libraries
🧗 Base Camp
Start here and climb your way up!