🧗 Start Here
Scroll down to complete this route
Midterm Route M2: Proteome Cartography
- RouteID: M002
- Wall: Protein Representations (W05)
- Grade: 5.10c (Midterm)
- Routesetter: Adrian
- Date: 02/04/2026
The Setup
Your PI is interested in conserved protein functions across distant bacterial species. She gives you two organisms:
- E. coli — the workhorse of molecular biology, gram-negative
- Bacillus subtilis — a gram-positive model organism, diverged from E. coli ~2-3 billion years ago
"I want to see which proteins are conserved across these two species. Don't just give me a list — show me a map. I want to see where their proteomes overlap. Oh, and make it interactive if you can — I'm a visual learner, I want to hover over points and explore."
You remember that proteins with similar functions cluster together in embedding space, even across species. Time to make a map.
Your Mission
- Combine protein embeddings from both proteomes
- Project everything into 2D with UMAP
- Build an interactive visualization colored by organism
- Find clusters where both species have proteins — these are conserved functions
- Pick a cluster and investigate what you found
Prerequisites
- R013 (The UniProt Topo Guide) — you need the E. coli embeddings
- R015 (Vector Spaces & Projections) — you should be comfortable with UMAP and interactive plots
Data Files
Download these before starting:
| File | Description | How to get it |
|---|---|---|
| E. coli proteome embeddings | .h5 file | You should already have this or know how to get it |
| E. coli proteome table | Protein names, genes, functions | You should already have this or know how to get it |
| B. subtilis proteome embeddings | .h5 file | Download |
| B. subtilis proteome table | Protein names, genes, functions | Download from UniProt (same proteome) |
Note: Both embedding files from UniProt are named per-protein.h5. Rename them after downloading so you don't get confused (e.g., ecoli_embeddings.h5 and bacillus_embeddings.h5).
Note: When downloading the proteome tables, customize the columns to include at least: Entry, Protein names, Gene Names, Organism, Function [CC]. You'll need these for your hover labels and cluster analysis.
Exercise 1: Load Both Proteomes
Goal: Load embeddings for both E. coli and Bacillus subtilis.
- Load both
.h5embedding files - Count proteins in each — E. coli should have ~4,400, Bacillus ~4,200
- Verify embedding dimensions match (both should be 1024)
- Keep track of which protein comes from which organism — you'll need this for coloring
Hints:
- Create a list or array that labels each protein as "ecoli" or "bacillus"
- You'll need these labels when you make your plot
Success check:
- You have ~8,600 total protein embeddings
- You know which organism each embedding belongs to
Exercise 2: Combine and Project
Goal: Stack all embeddings and run UMAP to get 2D coordinates.
- Combine E. coli and Bacillus embeddings into one big matrix
- Run UMAP to project from 1024 dimensions to 2D
- Store the 2D coordinates alongside organism labels and protein metadata (from the proteome tables)
⚠️ WARNING — READ THIS ⚠️
UMAP coordinates only make sense relative to what was projected together.
- ✅ CORRECT: Combine all ~8,600 proteins into ONE matrix, run UMAP once
- ❌ WRONG: Run UMAP on E. coli, then run UMAP on Bacillus separately, then try to plot them together
If you do it the wrong way, the coordinates are meaningless — a point at (5, 3) in one UMAP has nothing to do with (5, 3) in the other. You won't see real overlaps, just noise.
Hints:
- Remember to
!pip install umap-learnat the start of your Colab session (it's not pre-installed) np.vstack()ornp.concatenate()to combine matrices- UMAP on ~8,600 proteins takes a minute or two — take a deep breath, stretch your legs, you've earned it
- If your favorite chatbot has tips for UMAP parameters, ask it
Success check:
- You have 2D coordinates for all ~8,600 proteins
- Shape should be
(~8600, 2)
Exercise 3: Interactive Map
Goal: Create an interactive scatter plot colored by organism.
Build a visualization where:
- Each point is a protein
- Color indicates organism (e.g., blue = E. coli, orange = Bacillus)
- Hovering shows the protein's name and function (not just the UniProt ID — that's useless for exploration!)
Hints:
- You'll need to merge your 2D coordinates with the proteome tables to get protein names
- Plotly Express makes the plotting straightforward
- The
colorparameter controls coloring hover_nameorhover_datacontrols what appears on hover
Success check:
- You can see two colors intermingled in some regions, separated in others
- Hovering reveals protein identities
- You can zoom into dense regions
Exercise 4: Find the Overlaps
Goal: Identify regions where both organisms have proteins.
Explore your map and look for clusters where blue and orange points mix together. These are the interesting regions — proteins from distant species that ended up in the same place in embedding space.
- Visually identify 2-3 mixed clusters
- For each cluster, estimate the rough (x, y) boundaries
- Extract the proteins that fall within those boundaries
Hints:
- You can filter by coordinate ranges:
(x > x_min) & (x < x_max) & (y > y_min) & (y < y_max) - Aim for clusters with at least 20-30 proteins from each organism
- It's okay if boundaries are approximate
Optional (extra credit): Instead of eyeballing boundaries, you can use a clustering algorithm like DBSCAN or HDBSCAN on your 2D UMAP coordinates to objectively identify clusters. Each protein gets a cluster label, and you just filter by label. Ask your favorite chatbot how to set this up.
Success check:
- You identified at least one cluster with proteins from both organisms
- You can list the UniProt IDs of proteins in that cluster
Exercise 5: What Did You Find?
Goal: Investigate the proteins in your chosen cluster.
Pick one mixed cluster and dig in:
- How many proteins from each organism?
- Filter your proteome tables to just the proteins in that cluster
- What functions do they have? Do they share a common theme?
- Why might these proteins cluster together despite coming from organisms that diverged billions of years ago?
Hints:
- You already have the protein names and functions in your proteome tables — use them!
- Look at the "Protein names" or "Function" columns for patterns
- Conserved functions often relate to core cellular processes: metabolism, DNA replication, protein synthesis, etc.
Write 3-4 sentences about what you found and why it makes sense evolutionarily.
Success check:
- You identified a shared function or theme in your cluster
- You can explain why these proteins cluster together
Exercise 6: Reflection
Goal: Connect the map to biological insight.
Answer in your notebook (2-3 sentences each):
-
If two proteins from different species end up in the same cluster, what does that suggest about them?
-
What did you learn from this exercise about how embeddings capture biological information?
Deliverables
Submit your completed notebook (.ipynb) with:
- All code cells executed
- Your interactive UMAP visualization (or a screenshot if it doesn't render)
- Analysis of at least one mixed cluster
- Reflection answers
- A Logbook section at the end with
[LOGBOOK]entries
Reminder: We've switched to including [LOGBOOK] entries directly in each route's notebook rather than in a separate file. Add your logbook entries as markdown cells at the end of this notebook.
Submission
🎉 Route Complete!
Great work!