Midterm Route M1: The Embedding Shortcut

RouteID: M001
Wall: Protein Representations (W05)
Grade: 5.10c (Midterm)
Routesetter: Adrian
Date: 02/04/2026

The Setup

You're a computational biologist who just joined a research lab studying protein evolution. Your PI hands you four human proteins and says:

"Find me the most similar proteins in E. coli. But here's the catch — BLAST is down for maintenance. Figure out another way."

You remember from class that protein language models encode sequences as vectors, and similar proteins should have similar vectors. Time to put that knowledge to work.

Your Mission

Use embedding cosine similarity to find E. coli proteins similar to your human query proteins
Retrieve the sequences of your top hits
Decode a hidden message using coordinates we provide
The message reveals a key insight about what you just did

Prerequisites

R013 (The UniProt Topo Guide) — you need the E. coli embeddings file (per-protein.h5)
R015 (Vector Spaces & Projections) — you should be comfortable with cosine similarity

Data Files

Download these before starting:

File	Description	Link
`query_proteins.h5`	ProtT5 embeddings for 4 human proteins	Download
`per-protein.h5`	E. coli proteome embeddings (from R013)	You should already have this or know how to get it
`ecoli.fasta`	E. coli proteome sequences (from R013)	You should already have this or know how to get it
`secret_coordinates.json`	Coordinates for decoding the message	Download

Your Query Proteins

UniProt ID	Protein Name	Description
P07437	TUBB	Beta-tubulin, microtubule component
P0DMV8	HSPA1A	Heat shock protein 70 (Hsp70)
P28340	POLD1	DNA polymerase delta catalytic subunit
P60709	ACTB	Beta-actin, cytoskeletal protein

Exercise 1: Load Your Data

Goal: Load the query protein embeddings and E. coli proteome embeddings.

Load both .h5 files using h5py
Verify the query file has 4 proteins and E. coli has ~4,400
Check that embedding dimensions match (should be 1024)

Hints:

You've done this before in R015
HDF5 files work like dictionaries

Success check:

You can list the UniProt IDs in each file
Both have embeddings of shape (1024,)

Exercise 2: Compute Similarity

Goal: Compute cosine similarity between each query protein and ALL E. coli proteins.

You need to compare 4 query embeddings against ~4,400 E. coli embeddings. That's ~17,600 comparisons. You can loop through each pair, but it'll be slow. Faster approach: vectorize — stack all your embeddings into matrices and compute all similarities in one function call. NumPy and sklearn are optimized for this. If this is new to you, ask your favorite chatbot to explain the vectorized approach.

Critical: When you extract query IDs, use this exact order:

["P07437", "P0DMV8", "P28340", "P60709"]

The secret message coordinates depend on this order. If you use a different order, the message won't decode correctly.

Hints:

Stack embeddings into matrices first
sklearn.metrics.pairwise.cosine_similarity can do all comparisons at once
Your result should be shape (4, ~4400)

Success check:

Similarity matrix has shape (4, 4403) or similar
Cosine similarity ranges from -1 to 1 in theory; your top hits should be in the ~0.7 range

Exercise 3: Find Top Hits

Goal: For each query protein, find the 10 most similar E. coli proteins with their sequences.

Find the indices of the 10 highest similarity values per query
Map indices back to E. coli UniProt IDs
Load sequences from ecoli.fasta
Build a DataFrame with columns: query, rank, ref_id, similarity, sequence

Hints:

np.argsort() gives you sorted indices
BioPython's SeqIO.parse() reads FASTA files
Watch out: FASTA headers look like sp|P0A9B2|GAPDH_ECOLI — extract the UniProt ID

Success check:

DataFrame has 40 rows (4 queries × 10 hits)
Each row includes the full amino acid sequence
Top similarities around 0.70-0.78

Exercise 4: Explore Your Hits

Goal: Understand what proteins you found.

Before decoding the message, explore your results:

For each query, what's the #1 hit? Look it up on UniProt.
Learn something about both the query protein and its top hit — what do they do? Are they related in some way (function, structure, evolutionary history)?
Do any queries give similar hit lists? Why might that be?

Write 2-3 sentences about patterns you observe.

Hint: UniProt has a REST API if you want to fetch protein names programmatically.

Exercise 5: Decode the Secret Message

Goal: Use the provided coordinates to reveal a hidden message.

Download secret_coordinates.json. It contains a list of (row_index, position) tuples. Each tuple tells you:

Which row in your DataFrame (df_top_hits.iloc[row_index])
Which position in that protein's sequence

Extract the amino acid at each position. Concatenate them. Spaces are marked as ("SPACE", -1).

Success check:

You decode a readable English message (with some quirky spelling — amino acids don't cover the whole alphabet!)
The message is the take-home message of this midterm. If you remember one thing from this route, it should be this.

Exercise 6: Reflection

Goal: Connect the message to what you did.

Answer in your notebook (2-3 sentences each):

What does the decoded message mean? Explain it in your own words.
You found similar proteins without BLAST. What did you use instead, and why does it work?

Deliverables

Submit your completed notebook (.ipynb) with:

All code cells executed
The decoded secret message clearly displayed
Your reflection answers in markdown cells
A Logbook section at the end with [LOGBOOK] entries — short notes about your thinking process

Reminder: We've switched to including [LOGBOOK] entries directly in each route's notebook rather than in a separate file. Add your logbook entries as markdown cells at the end of this notebook.

Submission

Submit your notebook here