Navigate
Back to Gym
← Back to Wall

Final Exam Route 36 - PLM Protein Clustering

Route ID: F036 • Wall: W05 • Released: Mar 16, 2026

5.10b
ready

🎉 Sent!

You made it to the top. Submit your work above!

Submission

Submit your notebook here


Deliverable

Submit one notebook that includes:

  1. ESM-2 residue-level embeddings for all 18 proteins
  2. Pairwise inter-protein similarity matrix (18x18 heatmap)
  3. Identification of the protein cluster (which proteins group together?)
  4. Full inter-protein heatmaps for 2-3 top pairs from the cluster
  5. Identification of the shared domain
  6. Brief biological interpretation

Mission checklist

  • Generated ESM-2 embeddings for all 18 proteins
  • Computed pairwise inter-protein similarity
  • Built and plotted similarity heatmap
  • Identified the tight cluster of proteins
  • Generated full inter-protein heatmaps for top cluster pairs
  • Named the shared domain and answered verification questions
  • Wrote biological interpretation

Exercise 4: Name the Domain

You've found a cluster. You've looked up what those proteins have in common. Now prove you understand what you found.

The verification

Answer these three questions in a markdown cell:

  1. What is the name of the shared structural domain that unites the clustered proteins?

  2. What is the approximate size (in residues) of this domain?

  3. Name one biological function this domain is known for.

Students who did the analysis correctly and looked up their clustered proteins will get this right. Students who didn't will guess wrong.


Exercise 3: Investigate the Cluster

This is where the science gets exciting — you've found a pattern in the data, now explain it.

Part 1 — Domain hypothesis

Look up each protein in your cluster:

  • What domains do they contain?
  • Is there a shared structural element?
  • What biological function might drive the similarity?

Tools you can use:

  • UniProt (uniprot.org) — search by ID, check "Family & Domains"
  • InterPro (ebi.ac.uk/interpro) — domain annotations
  • Your chatbot — ask it to help interpret domain annotations

Deliverable: A markdown cell stating:

  • What domain/motif the clustered proteins share
  • Where in the sequence this domain is located
  • Why ESM-2 would detect this similarity

Part 2 — Full inter-protein heatmaps

For the top 2-3 pairs from your cluster (highest pairwise similarity):

  1. Generate the full window-vs-window cosine similarity matrix
  2. Plot as a heatmap
  3. Mark where the high-similarity hot spots occur
  4. Do the hot spot positions match the shared domain locations?

You built this in Route 36A/36B — adapt your practice code.

Deliverable: 2-3 inter-protein heatmaps with hot spots annotated.


Exercise 2: Build the Similarity Matrix

Compute pairwise inter-protein similarity for all protein pairs.

The metric

For each pair of proteins (A, B):

  1. Compute all window embeddings for both proteins (you know how to do this)
  2. Calculate cosine similarity between all window pairs
  3. Summarize with the top-10 mean — the average of the 10 highest similarities

This gives you one number per protein pair. You'll have N×(N-1)/2 pairs total.

Your task

Build an N×N similarity matrix and plot it as a heatmap.

You have the building blocks from practice:

  • get_window_embeddings() from Route 36A
  • cosine_similarity() from sklearn

Now combine them to build the full pairwise matrix. Use your chatbot if you get stuck, but try it yourself first.

Pro tip: Reorder with hierarchical clustering

A raw heatmap shows proteins in arbitrary order — any cluster will be scattered and hard to see. Reorder the matrix using hierarchical clustering to group similar proteins together:

from scipy.cluster.hierarchy import linkage, leaves_list
from scipy.spatial.distance import squareform

# Convert similarity to distance
dist_matrix = 1 - similarity_matrix
np.fill_diagonal(dist_matrix, 0)

# Cluster and get new order
Z = linkage(squareform(dist_matrix), method='average')
order = leaves_list(Z)

# Reorder matrix and labels
reordered_matrix = similarity_matrix[np.ix_(order, order)]
reordered_proteins = [proteins[i] for i in order]

Now plot reordered_matrix with reordered_proteins as labels. Similar proteins will be adjacent, making any cluster obvious.

Checkpoint

Look at your reordered heatmap:

  • Do you notice any group of proteins with elevated similarity to each other?
  • How many proteins are in that group?
  • Which ones?

List the protein IDs you identify as the cluster.


Exercise 1: Generate ESM-2 Embeddings

Generate residue-level ESM-2 embeddings for all 18 proteins.

Your data

Download from Google Drive: F036 Student Materials

The CSV contains protein IDs and sequences. You'll look up what they are as part of your investigation.

Model setup

!pip -q install fair-esm

import torch
import esm

model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

Your task

Write a loop that generates embeddings for all 18 proteins and stores them in a dictionary.

You did this in Route 36A for 10-20 sampled proteins. Adapt your practice code for this dataset.

Checkpoint: Each embedding should have shape (sequence_length, 1280).


Exercise 0: Setup

Runtime

  • GPU runtime required (Colab T4/L4/A100)
  • CPU will be too slow for 18 proteins

Install dependencies

!pip -q install fair-esm pandas matplotlib seaborn scikit-learn

The roadmap

18 protein sequences
        ↓
    ESM-2 embeddings
        ↓
    Pairwise similarity matrix
        ↓
    Find the cluster
        ↓
    Investigate WHY
        ↓
    Name the domain

Don't worry about understanding every step yet — just get your runtime set up and keep climbing.

Suggested chatbot prompt

If you get stuck at any point, try:

"I'm working on CHEM 169 Final Route F036. I have 18 proteins and need to find which ones share a structural domain using ESM-2 embeddings and pairwise cosine similarity. Help me [specific task]."


Intro

You've already built the core tools — ESM embeddings, window cosine similarity, heatmaps. In Routes 36A and 36B, you learned to find internal repeats and compare protein regions.

Now you're applying those skills to a real discovery problem: can you find hidden structural relationships in a set of 18 mystery proteins?

The mystery

These 18 proteins hide a secret. Some of them share something in common — a structural feature that ESM-2 can detect. Your job is to find the hidden group and figure out what connects them.

Your task

  1. Find which proteins cluster together
  2. Figure out what they have in common
  3. Name the shared domain

Dataset

  • 18 proteins — some share a domain, others don't
  • You don't know which is which — that's what you're discovering

Final Exam Route 036: PLM Protein Clustering

  • RouteID: F036
  • Wall: Protein Representations (W05)
  • Grade: 5.10b
  • Routesetter: Course Staff
  • Time: 1.5 hours in class + finish by end of day
  • You'll need: GPU runtime, ESM-2, 18 protein sequences, plotting libraries

🧗 Base Camp

Start here and climb your way up!