Route 015: Vector Spaces & Projections

RouteID: 015
Wall: Protein Representations (W05)
Grade: 5.9
Routesetter: Adrian
Date: 01/29/2026
Dataset: E. coli protein embeddings (per-protein.h5 from UniProt, downloaded in R013)
Prerequisite: R013 — The UniProt Topo Guide (you need the E. coli embeddings file)

Why this route exists

In R013, you explored UniProt, navigated reference proteomes, and at the end downloaded protein embeddings for a real organism. Each protein in that file is represented as a high-dimensional vector — a list of numbers that captures learned patterns from millions of protein sequences.

But what can you actually do with a list of numbers? Before you can use embeddings for classification, clustering, or prediction, you need to be comfortable with the basic operations: measuring how similar two vectors are, finding neighbors, and visualizing high-dimensional data in 2D.

In this route, you will take those real protein embeddings and practice the exact operations that underlie modern computational biology workflows: computing distances, searching for similar proteins, projecting into 2D, and building interactive visualizations.

What you'll be able to do after this route

By the end, you can:

Load protein embeddings from an HDF5 file into Python
Write a function that computes distance or similarity between two protein vectors
Use that function to search for the most similar protein in a dataset
Project high-dimensional protein embeddings into 2D using PCA, t-SNE, or UMAP
Create interactive scatter plots with hover labels using Plotly
Reason about whether neighbors in 2D match neighbors in high dimensions

Key definitions

Vector A list of numbers representing an object's features. In this route, each protein is a vector of ~1280 numbers (its ESM embedding) that encode sequence patterns learned by the model.

Cosine similarity A measure of how similar two vectors are based on the angle between them. Ranges from -1 (opposite) to 1 (identical direction). Ignores magnitude. Widely used for comparing embeddings.

Dimensionality reduction The process of projecting high-dimensional vectors (e.g., 1280 dimensions) into a lower-dimensional space (e.g., 2D) while trying to preserve structure. Common methods: PCA, t-SNE, UMAP.

Interactive plot A visualization where you can hover, zoom, or click to explore data points. Libraries like Plotly and Bokeh make these in Python.

Exercise 0: Load the E. coli Protein Embeddings

Goal: Load the protein embeddings you downloaded in R013 and understand their structure.

Load the E. coli per-protein embeddings file (per-protein.h5) using the h5py library.
Inspect the file: how many proteins are there? What are the keys?
Extract the embedding for one protein and print its shape. How many dimensions does each embedding have?
Collect all protein embeddings into a NumPy array or DataFrame, keeping track of which row corresponds to which protein.

Hint: HDF5 files work like dictionaries. Use h5py.File('per-protein.h5', 'r') to open, then iterate over keys.

Note: If you didn't complete R013, you can download the E. coli embeddings directly from UniProt: go to the E. coli proteome, then download the embeddings file (UP000000625_83333/per-protein.h5).

Success check:

The file loads without errors.
You know how many proteins and how many dimensions each embedding has.
You have a matrix where each row is one protein's embedding.

Exercise 1: Write a Distance Function

If you completed R014, you already used cosine similarity to compare toy sequence embeddings. Here you'll do the same thing, but this time you'll write a reusable function and apply it to thousands of real proteins instead of a handful of toy sequences.

Goal: Write a function that computes the similarity or distance between two protein embeddings.

Write a Python function called compute_distance(vec_a, vec_b, metric='cosine') that:
- Takes two vectors (NumPy arrays) as input
- Takes a metric parameter specifying which distance to use
- Supports at least cosine distance and one other metric of your choice (e.g., Euclidean)
- Returns the computed distance as a float
Pick two proteins from your dataset and compute the distance between their embeddings.
Print the result and verify it makes sense.

Tools:

scipy.spatial.distance has implementations of many metrics
Or implement cosine similarity from scratch: dot(a, b) / (norm(a) * norm(b))

Success check:

Your function works for at least two different metrics.
You can explain what a cosine distance of 0.0 vs 1.0 means in terms of protein similarity.

Common fall:

Confusing cosine similarity (1 = identical) with cosine distance (0 = identical). Make sure you know which one your function returns.

Exercise 2: Find Your Nearest Neighbor Protein

Goal: Use your distance function to find the protein most similar to a chosen protein.

Pick one protein from your dataset (e.g., pick a well-known E. coli protein like dnaK or rpoB, or just pick the first one).
Compute the distance between that protein's embedding and every other protein in the dataset.
Store the results (e.g., in a dictionary or a new DataFrame column).
Find and display the protein with the smallest distance (most similar).
Also display the protein with the largest distance (least similar).

Bonus: Look up both proteins on UniProt. Does the nearest neighbor make biological sense? Do they share a function, family, or domain?

Success check:

You computed distances to all other proteins.
You identified the most and least similar protein with their distances.
You have a hypothesis about whether the nearest neighbor makes biological sense.

Exercise 3: 2D Projection (Static Plot)

If you completed R014, you made a PCA scatter plot of a few toy sequences. Now you'll do the same thing at proteome scale (~4,400 proteins) and try methods beyond PCA.

Goal: Project the high-dimensional protein embeddings into 2D and make a scatter plot.

Use at least one of the following dimensionality reduction methods to project all protein embeddings into 2D:
- PCA (sklearn.decomposition.PCA)
- t-SNE (sklearn.manifold.TSNE)
- UMAP (umap.UMAP) — you may need to pip install umap-learn
Store the 2D coordinates.
Make a static scatter plot (matplotlib or seaborn) of all proteins in 2D space.
Label or highlight at least a few proteins so you can orient yourself.

Note: The E. coli proteome has ~4,400 proteins. t-SNE and UMAP may take a minute to run — that's normal.

Bonus: Try two different methods and compare. Do they give similar arrangements?

Success check:

You have a 2D scatter plot with visible clusters or structure.
You can point to specific proteins on the plot.

Common fall:

Forgetting to install UMAP (pip install umap-learn, not pip install umap).

Exercise 4: Interactive 2D Projection

Goal: Make an interactive version of your scatter plot using Plotly.

You'll need: A table with protein metadata (names, gene names, etc.) for your E. coli proteins. If you don't have one, figure out how to download it from UniProt now — you want the TSV with protein names and annotations for your proteome.

Using the 2D coordinates from Exercise 3, create an interactive scatter plot with Plotly Express (or Bokeh).
Merge your 2D coordinates with the proteome table so each point has metadata attached.
Configure hover text so that when you mouse over a point, it shows the protein name (not just the UniProt ID).
Optionally, color the points by some property (e.g., sequence length, annotation score, or subcellular location).

Tools:

plotly.express.scatter with the hover_name or hover_data parameter.

Success check:

Hovering over any point reveals the protein name.
The plot is zoomable and pannable.
You can explore clusters and identify which proteins group together.

Exercise 5: High-D Neighbors vs. 2D Neighbors

Goal: Investigate whether nearest neighbors in high dimensions match nearest neighbors in the 2D projection.

Take the protein you chose in Exercise 2.
Find its nearest neighbor in the 2D projection (smallest Euclidean distance in the projected space).
Compare this to the nearest neighbor you found in Exercise 2 (high-dimensional embedding space).
Are they the same protein? If not, why might they differ?
Write a brief reflection (2-3 sentences) on what this tells you about dimensionality reduction — does it perfectly preserve distances?

Success check:

You identified the nearest neighbor in both spaces.
You have a clear answer: same or different?
Your reflection shows understanding that projections can distort distances.

Common fall:

Assuming the 2D projection perfectly preserves all relationships. It doesn't — that's the whole point of this exercise.

Deliverables

Submit your Colab/Jupyter notebook (.ipynb) with all exercises completed.

Include a Logbook section at the end of your notebook with [LOGBOOK] entries — short reflections in markdown cells about what you're thinking, what confused you, or what you learned.

Submission

Submit your notebook here