Submission

Submit your notebook here

Deliverable

Submit one notebook that includes:

Loaded protein_embeddings.pkl created in Route 36A
Implemented get_window_embedding(...) and tested it
Built reference-vs-sliding cosine similarity profile
Built full window-vs-window cosine heatmap
Wrote a short interpretation tied to UniProt annotations

Mission checklist

Loaded generated embedding dictionary
Extracted a window embedding
Computed reference-vs-sliding cosine profile
Plotted similarity profile with reference marker
Computed full similarity matrix
Plotted heatmap and interpreted biological signal

Exercise 4: Extra Credit Scale to More Proteins

Repeat Exercises 2-3 for at least 2 additional proteins
Compare qualitative heatmap patterns across proteins
Add one hypothesis for why patterns differ biologically

Exercise 3: Full Cosine Heatmap Window vs Window

Pick the same protein you used in Exercise 2
Use window size 10
Precompute all window embeddings, then compute all pairwise cosine similarities
Plot heatmap:
- X-axis: window start position
- Y-axis: window start position
- Color: cosine similarity
Interpret:
- diagonal behavior
- off-diagonal blocks/bands
- one plausible link to UniProt domain/function notes

Exercise 2: Reference vs Sliding Cosine Profile

Choose one protein from your dictionary (prefer L >= 120)
Set window_size = 5
Set ref_pos = L // 2
Compute cosine similarity between reference window and all valid sliding windows
Plot profile with a vertical line at ref_pos
Write 2-4 sentences interpreting the shape of the profile

Exercise 1: Build Window Embedding Function

Implement:

get_window_embedding(emb_dict, uniprot_id, start_pos, window_size)

Requirements:

Slice [start_pos : start_pos + window_size]
Mean-pool residues in that slice
Return shape (1280,)

Required test:

Pick one ID from your dictionary
Use window_size = 5
Use start_pos = min(50, L - window_size)
Print shape and short preview values

Exercise 0: Load Embeddings from Route 36A

You must complete Route 36A first.

Load your generated file:

import pickle

with open("protein_embeddings.pkl", "rb") as f:
    emb_dict = pickle.load(f)

print(type(emb_dict), len(emb_dict))
for k in list(emb_dict.keys())[:3]:
    print(k, emb_dict[k].shape)

Checks:

Object is a dictionary
Multiple proteins are present (target 10-20)
Each value has shape (L, 1280)

If you do not have this file yet, complete Route 36A first.

Intro

This route is the second half of the PLM practice final prep.

You already generated residue-level embeddings in 36A. Now you will analyze local regions by sliding windows and cosine similarity.

The core move is the accordion idea:

residue embeddings are the expanded representation
mean pooling contracts local windows
cosine similarity compares local regions across the protein

Accordion analogy for PLM representations

Route 036B: Residue-Level PLM Cosine Analysis

RouteID: 036B
Wall: Protein Representations (W05)
Grade: 5.10a
Routesetter: Course Staff
Time: ~45-60 minutes
You'll need: protein_embeddings.pkl from 36A, notebook runtime, plotting, and UniProt lookup.