🧗 Start Here
Scroll down to complete this route
Midterm Route M1: The Embedding Shortcut
- RouteID: M001
- Wall: Protein Representations (W05)
- Grade: 5.10c (Midterm)
- Routesetter: Adrian
- Date: 02/04/2026
The Setup
You're a computational biologist who just joined a research lab studying protein evolution. Your PI hands you four human proteins and says:
"Find me the most similar proteins in E. coli. But here's the catch — BLAST is down for maintenance. Figure out another way."
You remember from class that protein language models encode sequences as vectors, and similar proteins should have similar vectors. Time to put that knowledge to work.
Your Mission
- Use embedding cosine similarity to find E. coli proteins similar to your human query proteins
- Retrieve the sequences of your top hits
- Decode a hidden message using coordinates we provide
- The message reveals a key insight about what you just did
Prerequisites
- R013 (The UniProt Topo Guide) — you need the E. coli embeddings file (
per-protein.h5) - R015 (Vector Spaces & Projections) — you should be comfortable with cosine similarity
Data Files
Download these before starting:
| File | Description | Link |
|---|---|---|
query_proteins.h5 | ProtT5 embeddings for 4 human proteins | Download |
per-protein.h5 | E. coli proteome embeddings (from R013) | You should already have this or know how to get it |
ecoli.fasta | E. coli proteome sequences (from R013) | You should already have this or know how to get it |
secret_coordinates.json | Coordinates for decoding the message | Download |
Your Query Proteins
| UniProt ID | Protein Name | Description |
|---|---|---|
| P07437 | TUBB | Beta-tubulin, microtubule component |
| P0DMV8 | HSPA1A | Heat shock protein 70 (Hsp70) |
| P28340 | POLD1 | DNA polymerase delta catalytic subunit |
| P60709 | ACTB | Beta-actin, cytoskeletal protein |
Exercise 1: Load Your Data
Goal: Load the query protein embeddings and E. coli proteome embeddings.
- Load both
.h5files usingh5py - Verify the query file has 4 proteins and E. coli has ~4,400
- Check that embedding dimensions match (should be 1024)
Hints:
- You've done this before in R015
- HDF5 files work like dictionaries
Success check:
- You can list the UniProt IDs in each file
- Both have embeddings of shape
(1024,)
Exercise 2: Compute Similarity
Goal: Compute cosine similarity between each query protein and ALL E. coli proteins.
You need to compare 4 query embeddings against ~4,400 E. coli embeddings. That's ~17,600 comparisons. You can loop through each pair, but it'll be slow. Faster approach: vectorize — stack all your embeddings into matrices and compute all similarities in one function call. NumPy and sklearn are optimized for this. If this is new to you, ask your favorite chatbot to explain the vectorized approach.
Critical: When you extract query IDs, use this exact order:
["P07437", "P0DMV8", "P28340", "P60709"]
The secret message coordinates depend on this order. If you use a different order, the message won't decode correctly.
Hints:
- Stack embeddings into matrices first
sklearn.metrics.pairwise.cosine_similaritycan do all comparisons at once- Your result should be shape
(4, ~4400)
Success check:
- Similarity matrix has shape
(4, 4403)or similar - Cosine similarity ranges from -1 to 1 in theory; your top hits should be in the ~0.7 range
Exercise 3: Find Top Hits
Goal: For each query protein, find the 10 most similar E. coli proteins with their sequences.
- Find the indices of the 10 highest similarity values per query
- Map indices back to E. coli UniProt IDs
- Load sequences from
ecoli.fasta - Build a DataFrame with columns:
query,rank,ref_id,similarity,sequence
Hints:
np.argsort()gives you sorted indices- BioPython's
SeqIO.parse()reads FASTA files - Watch out: FASTA headers look like
sp|P0A9B2|GAPDH_ECOLI— extract the UniProt ID
Success check:
- DataFrame has 40 rows (4 queries × 10 hits)
- Each row includes the full amino acid sequence
- Top similarities around 0.70-0.78
Exercise 4: Explore Your Hits
Goal: Understand what proteins you found.
Before decoding the message, explore your results:
- For each query, what's the #1 hit? Look it up on UniProt.
- Learn something about both the query protein and its top hit — what do they do? Are they related in some way (function, structure, evolutionary history)?
- Do any queries give similar hit lists? Why might that be?
Write 2-3 sentences about patterns you observe.
Hint: UniProt has a REST API if you want to fetch protein names programmatically.
Exercise 5: Decode the Secret Message
Goal: Use the provided coordinates to reveal a hidden message.
Download secret_coordinates.json. It contains a list of (row_index, position) tuples. Each tuple tells you:
- Which row in your DataFrame (
df_top_hits.iloc[row_index]) - Which position in that protein's sequence
Extract the amino acid at each position. Concatenate them. Spaces are marked as ("SPACE", -1).
Success check:
- You decode a readable English message (with some quirky spelling — amino acids don't cover the whole alphabet!)
- The message is the take-home message of this midterm. If you remember one thing from this route, it should be this.
Exercise 6: Reflection
Goal: Connect the message to what you did.
Answer in your notebook (2-3 sentences each):
-
What does the decoded message mean? Explain it in your own words.
-
You found similar proteins without BLAST. What did you use instead, and why does it work?
Deliverables
Submit your completed notebook (.ipynb) with:
- All code cells executed
- The decoded secret message clearly displayed
- Your reflection answers in markdown cells
- A Logbook section at the end with
[LOGBOOK]entries — short notes about your thinking process
Reminder: We've switched to including [LOGBOOK] entries directly in each route's notebook rather than in a separate file. Add your logbook entries as markdown cells at the end of this notebook.
Submission
🎉 Route Complete!
Great work!