Submission

Submit your notebook here

Deliverable

Submit one notebook that includes:

ESM-2 residue-level embeddings for all 18 proteins
Pairwise inter-protein similarity matrix (18x18 heatmap)
Identification of the protein cluster (which proteins group together?)
Full inter-protein heatmaps for 2-3 top pairs from the cluster
Identification of the shared domain
Brief biological interpretation

Mission checklist

Generated ESM-2 embeddings for all 18 proteins
Computed pairwise inter-protein similarity
Built and plotted similarity heatmap
Identified the tight cluster of proteins
Generated full inter-protein heatmaps for top cluster pairs
Named the shared domain and answered verification questions
Wrote biological interpretation

Exercise 4: Name the Domain

You've found a cluster. You've looked up what those proteins have in common. Now prove you understand what you found.

The verification

Answer these three questions in a markdown cell:

What is the name of the shared structural domain that unites the clustered proteins?
What is the approximate size (in residues) of this domain?
Name one biological function this domain is known for.

Students who did the analysis correctly and looked up their clustered proteins will get this right. Students who didn't will guess wrong.

Exercise 3: Investigate the Cluster

This is where the science gets exciting — you've found a pattern in the data, now explain it.

Part 1 — Domain hypothesis

Look up each protein in your cluster:

What domains do they contain?
Is there a shared structural element?
What biological function might drive the similarity?

Tools you can use:

UniProt (uniprot.org) — search by ID, check "Family & Domains"
InterPro (ebi.ac.uk/interpro) — domain annotations
Your chatbot — ask it to help interpret domain annotations

Deliverable: A markdown cell stating:

What domain/motif the clustered proteins share
Where in the sequence this domain is located
Why ESM-2 would detect this similarity

Part 2 — Full inter-protein heatmaps

For the top 2-3 pairs from your cluster (highest pairwise similarity):

Generate the full window-vs-window cosine similarity matrix
Plot as a heatmap
Mark where the high-similarity hot spots occur
Do the hot spot positions match the shared domain locations?

You built this in Route 36A/36B — adapt your practice code.

Deliverable: 2-3 inter-protein heatmaps with hot spots annotated.

Exercise 2: Build the Similarity Matrix

Compute pairwise inter-protein similarity for all protein pairs.

The metric

For each pair of proteins (A, B):

Compute all window embeddings for both proteins (you know how to do this)
Calculate cosine similarity between all window pairs
Summarize with the top-10 mean — the average of the 10 highest similarities

This gives you one number per protein pair. You'll have N×(N-1)/2 pairs total.

Your task

Build an N×N similarity matrix and plot it as a heatmap.

You have the building blocks from practice:

get_window_embeddings() from Route 36A
cosine_similarity() from sklearn

Now combine them to build the full pairwise matrix. Use your chatbot if you get stuck, but try it yourself first.

Pro tip: Reorder with hierarchical clustering

A raw heatmap shows proteins in arbitrary order — any cluster will be scattered and hard to see. Reorder the matrix using hierarchical clustering to group similar proteins together:

from scipy.cluster.hierarchy import linkage, leaves_list
from scipy.spatial.distance import squareform

# Convert similarity to distance
dist_matrix = 1 - similarity_matrix
np.fill_diagonal(dist_matrix, 0)

# Cluster and get new order
Z = linkage(squareform(dist_matrix), method='average')
order = leaves_list(Z)

# Reorder matrix and labels
reordered_matrix = similarity_matrix[np.ix_(order, order)]
reordered_proteins = [proteins[i] for i in order]

Now plot reordered_matrix with reordered_proteins as labels. Similar proteins will be adjacent, making any cluster obvious.

Checkpoint

Look at your reordered heatmap:

Do you notice any group of proteins with elevated similarity to each other?
How many proteins are in that group?
Which ones?

List the protein IDs you identify as the cluster.

Exercise 1: Generate ESM-2 Embeddings

Generate residue-level ESM-2 embeddings for all 18 proteins.

Your data

Download from Google Drive: F036 Student Materials

The CSV contains protein IDs and sequences. You'll look up what they are as part of your investigation.

Model setup

!pip -q install fair-esm

import torch
import esm

model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

Your task

Write a loop that generates embeddings for all 18 proteins and stores them in a dictionary.

You did this in Route 36A for 10-20 sampled proteins. Adapt your practice code for this dataset.

Checkpoint: Each embedding should have shape (sequence_length, 1280).

Exercise 0: Setup

Runtime

GPU runtime required (Colab T4/L4/A100)
CPU will be too slow for 18 proteins

Install dependencies

!pip -q install fair-esm pandas matplotlib seaborn scikit-learn

The roadmap

18 protein sequences
        ↓
    ESM-2 embeddings
        ↓
    Pairwise similarity matrix
        ↓
    Find the cluster
        ↓
    Investigate WHY
        ↓
    Name the domain

Don't worry about understanding every step yet — just get your runtime set up and keep climbing.

Suggested chatbot prompt

If you get stuck at any point, try:

"I'm working on CHEM 169 Final Route F036. I have 18 proteins and need to find which ones share a structural domain using ESM-2 embeddings and pairwise cosine similarity. Help me [specific task]."

Intro

You've already built the core tools — ESM embeddings, window cosine similarity, heatmaps. In Routes 36A and 36B, you learned to find internal repeats and compare protein regions.

Now you're applying those skills to a real discovery problem: can you find hidden structural relationships in a set of 18 mystery proteins?

The mystery

These 18 proteins hide a secret. Some of them share something in common — a structural feature that ESM-2 can detect. Your job is to find the hidden group and figure out what connects them.

Your task

Find which proteins cluster together
Figure out what they have in common
Name the shared domain

Dataset

18 proteins — some share a domain, others don't
You don't know which is which — that's what you're discovering

Final Exam Route 036: PLM Protein Clustering

RouteID: F036
Wall: Protein Representations (W05)
Grade: 5.10b
Routesetter: Course Staff
Time: 1.5 hours in class + finish by end of day
You'll need: GPU runtime, ESM-2, 18 protein sequences, plotting libraries