Navigate
Back to Gym
← Back to Wall

The Embedding Shortcut

Route ID: M001 • Wall: W05 • Released: Feb 4, 2026

5.10c
ready

🧗 Start Here

Scroll down to complete this route

Midterm Route M1: The Embedding Shortcut

  • RouteID: M001
  • Wall: Protein Representations (W05)
  • Grade: 5.10c (Midterm)
  • Routesetter: Adrian
  • Date: 02/04/2026

The Setup

You're a computational biologist who just joined a research lab studying protein evolution. Your PI hands you four human proteins and says:

"Find me the most similar proteins in E. coli. But here's the catch — BLAST is down for maintenance. Figure out another way."

You remember from class that protein language models encode sequences as vectors, and similar proteins should have similar vectors. Time to put that knowledge to work.

Your Mission

  1. Use embedding cosine similarity to find E. coli proteins similar to your human query proteins
  2. Retrieve the sequences of your top hits
  3. Decode a hidden message using coordinates we provide
  4. The message reveals a key insight about what you just did

Prerequisites

  • R013 (The UniProt Topo Guide) — you need the E. coli embeddings file (per-protein.h5)
  • R015 (Vector Spaces & Projections) — you should be comfortable with cosine similarity

Data Files

Download these before starting:

FileDescriptionLink
query_proteins.h5ProtT5 embeddings for 4 human proteinsDownload
per-protein.h5E. coli proteome embeddings (from R013)You should already have this or know how to get it
ecoli.fastaE. coli proteome sequences (from R013)You should already have this or know how to get it
secret_coordinates.jsonCoordinates for decoding the messageDownload

Your Query Proteins

UniProt IDProtein NameDescription
P07437TUBBBeta-tubulin, microtubule component
P0DMV8HSPA1AHeat shock protein 70 (Hsp70)
P28340POLD1DNA polymerase delta catalytic subunit
P60709ACTBBeta-actin, cytoskeletal protein

Exercise 1: Load Your Data

Goal: Load the query protein embeddings and E. coli proteome embeddings.

  1. Load both .h5 files using h5py
  2. Verify the query file has 4 proteins and E. coli has ~4,400
  3. Check that embedding dimensions match (should be 1024)

Hints:

  • You've done this before in R015
  • HDF5 files work like dictionaries

Success check:

  • You can list the UniProt IDs in each file
  • Both have embeddings of shape (1024,)

Exercise 2: Compute Similarity

Goal: Compute cosine similarity between each query protein and ALL E. coli proteins.

You need to compare 4 query embeddings against ~4,400 E. coli embeddings. That's ~17,600 comparisons. You can loop through each pair, but it'll be slow. Faster approach: vectorize — stack all your embeddings into matrices and compute all similarities in one function call. NumPy and sklearn are optimized for this. If this is new to you, ask your favorite chatbot to explain the vectorized approach.

Critical: When you extract query IDs, use this exact order:

["P07437", "P0DMV8", "P28340", "P60709"]

The secret message coordinates depend on this order. If you use a different order, the message won't decode correctly.

Hints:

  • Stack embeddings into matrices first
  • sklearn.metrics.pairwise.cosine_similarity can do all comparisons at once
  • Your result should be shape (4, ~4400)

Success check:

  • Similarity matrix has shape (4, 4403) or similar
  • Cosine similarity ranges from -1 to 1 in theory; your top hits should be in the ~0.7 range

Exercise 3: Find Top Hits

Goal: For each query protein, find the 10 most similar E. coli proteins with their sequences.

  1. Find the indices of the 10 highest similarity values per query
  2. Map indices back to E. coli UniProt IDs
  3. Load sequences from ecoli.fasta
  4. Build a DataFrame with columns: query, rank, ref_id, similarity, sequence

Hints:

  • np.argsort() gives you sorted indices
  • BioPython's SeqIO.parse() reads FASTA files
  • Watch out: FASTA headers look like sp|P0A9B2|GAPDH_ECOLI — extract the UniProt ID

Success check:

  • DataFrame has 40 rows (4 queries × 10 hits)
  • Each row includes the full amino acid sequence
  • Top similarities around 0.70-0.78

Exercise 4: Explore Your Hits

Goal: Understand what proteins you found.

Before decoding the message, explore your results:

  1. For each query, what's the #1 hit? Look it up on UniProt.
  2. Learn something about both the query protein and its top hit — what do they do? Are they related in some way (function, structure, evolutionary history)?
  3. Do any queries give similar hit lists? Why might that be?

Write 2-3 sentences about patterns you observe.

Hint: UniProt has a REST API if you want to fetch protein names programmatically.


Exercise 5: Decode the Secret Message

Goal: Use the provided coordinates to reveal a hidden message.

Download secret_coordinates.json. It contains a list of (row_index, position) tuples. Each tuple tells you:

  • Which row in your DataFrame (df_top_hits.iloc[row_index])
  • Which position in that protein's sequence

Extract the amino acid at each position. Concatenate them. Spaces are marked as ("SPACE", -1).

Success check:

  • You decode a readable English message (with some quirky spelling — amino acids don't cover the whole alphabet!)
  • The message is the take-home message of this midterm. If you remember one thing from this route, it should be this.

Exercise 6: Reflection

Goal: Connect the message to what you did.

Answer in your notebook (2-3 sentences each):

  1. What does the decoded message mean? Explain it in your own words.

  2. You found similar proteins without BLAST. What did you use instead, and why does it work?


Deliverables

Submit your completed notebook (.ipynb) with:

  1. All code cells executed
  2. The decoded secret message clearly displayed
  3. Your reflection answers in markdown cells
  4. A Logbook section at the end with [LOGBOOK] entries — short notes about your thinking process

Reminder: We've switched to including [LOGBOOK] entries directly in each route's notebook rather than in a separate file. Add your logbook entries as markdown cells at the end of this notebook.

Submission

Submit your notebook here

🎉 Route Complete!

Great work!