Navigate
Back to Gym
← Back to Wall

Proteome Cartography

Route ID: M002 • Wall: W05 • Released: Feb 4, 2026

5.10c
ready

🧗 Start Here

Scroll down to complete this route

Midterm Route M2: Proteome Cartography

  • RouteID: M002
  • Wall: Protein Representations (W05)
  • Grade: 5.10c (Midterm)
  • Routesetter: Adrian
  • Date: 02/04/2026

The Setup

Your PI is interested in conserved protein functions across distant bacterial species. She gives you two organisms:

  • E. coli — the workhorse of molecular biology, gram-negative
  • Bacillus subtilis — a gram-positive model organism, diverged from E. coli ~2-3 billion years ago

"I want to see which proteins are conserved across these two species. Don't just give me a list — show me a map. I want to see where their proteomes overlap. Oh, and make it interactive if you can — I'm a visual learner, I want to hover over points and explore."

You remember that proteins with similar functions cluster together in embedding space, even across species. Time to make a map.

Your Mission

  1. Combine protein embeddings from both proteomes
  2. Project everything into 2D with UMAP
  3. Build an interactive visualization colored by organism
  4. Find clusters where both species have proteins — these are conserved functions
  5. Pick a cluster and investigate what you found

Prerequisites

  • R013 (The UniProt Topo Guide) — you need the E. coli embeddings
  • R015 (Vector Spaces & Projections) — you should be comfortable with UMAP and interactive plots

Data Files

Download these before starting:

FileDescriptionHow to get it
E. coli proteome embeddings.h5 fileYou should already have this or know how to get it
E. coli proteome tableProtein names, genes, functionsYou should already have this or know how to get it
B. subtilis proteome embeddings.h5 fileDownload
B. subtilis proteome tableProtein names, genes, functionsDownload from UniProt (same proteome)

Note: Both embedding files from UniProt are named per-protein.h5. Rename them after downloading so you don't get confused (e.g., ecoli_embeddings.h5 and bacillus_embeddings.h5).

Note: When downloading the proteome tables, customize the columns to include at least: Entry, Protein names, Gene Names, Organism, Function [CC]. You'll need these for your hover labels and cluster analysis.


Exercise 1: Load Both Proteomes

Goal: Load embeddings for both E. coli and Bacillus subtilis.

  1. Load both .h5 embedding files
  2. Count proteins in each — E. coli should have ~4,400, Bacillus ~4,200
  3. Verify embedding dimensions match (both should be 1024)
  4. Keep track of which protein comes from which organism — you'll need this for coloring

Hints:

  • Create a list or array that labels each protein as "ecoli" or "bacillus"
  • You'll need these labels when you make your plot

Success check:

  • You have ~8,600 total protein embeddings
  • You know which organism each embedding belongs to

Exercise 2: Combine and Project

Goal: Stack all embeddings and run UMAP to get 2D coordinates.

  1. Combine E. coli and Bacillus embeddings into one big matrix
  2. Run UMAP to project from 1024 dimensions to 2D
  3. Store the 2D coordinates alongside organism labels and protein metadata (from the proteome tables)

⚠️ WARNING — READ THIS ⚠️

UMAP coordinates only make sense relative to what was projected together.

  • CORRECT: Combine all ~8,600 proteins into ONE matrix, run UMAP once
  • WRONG: Run UMAP on E. coli, then run UMAP on Bacillus separately, then try to plot them together

If you do it the wrong way, the coordinates are meaningless — a point at (5, 3) in one UMAP has nothing to do with (5, 3) in the other. You won't see real overlaps, just noise.

Hints:

  • Remember to !pip install umap-learn at the start of your Colab session (it's not pre-installed)
  • np.vstack() or np.concatenate() to combine matrices
  • UMAP on ~8,600 proteins takes a minute or two — take a deep breath, stretch your legs, you've earned it
  • If your favorite chatbot has tips for UMAP parameters, ask it

Success check:

  • You have 2D coordinates for all ~8,600 proteins
  • Shape should be (~8600, 2)

Exercise 3: Interactive Map

Goal: Create an interactive scatter plot colored by organism.

Build a visualization where:

  • Each point is a protein
  • Color indicates organism (e.g., blue = E. coli, orange = Bacillus)
  • Hovering shows the protein's name and function (not just the UniProt ID — that's useless for exploration!)

Hints:

  • You'll need to merge your 2D coordinates with the proteome tables to get protein names
  • Plotly Express makes the plotting straightforward
  • The color parameter controls coloring
  • hover_name or hover_data controls what appears on hover

Success check:

  • You can see two colors intermingled in some regions, separated in others
  • Hovering reveals protein identities
  • You can zoom into dense regions

Exercise 4: Find the Overlaps

Goal: Identify regions where both organisms have proteins.

Explore your map and look for clusters where blue and orange points mix together. These are the interesting regions — proteins from distant species that ended up in the same place in embedding space.

  1. Visually identify 2-3 mixed clusters
  2. For each cluster, estimate the rough (x, y) boundaries
  3. Extract the proteins that fall within those boundaries

Hints:

  • You can filter by coordinate ranges: (x > x_min) & (x < x_max) & (y > y_min) & (y < y_max)
  • Aim for clusters with at least 20-30 proteins from each organism
  • It's okay if boundaries are approximate

Optional (extra credit): Instead of eyeballing boundaries, you can use a clustering algorithm like DBSCAN or HDBSCAN on your 2D UMAP coordinates to objectively identify clusters. Each protein gets a cluster label, and you just filter by label. Ask your favorite chatbot how to set this up.

Success check:

  • You identified at least one cluster with proteins from both organisms
  • You can list the UniProt IDs of proteins in that cluster

Exercise 5: What Did You Find?

Goal: Investigate the proteins in your chosen cluster.

Pick one mixed cluster and dig in:

  1. How many proteins from each organism?
  2. Filter your proteome tables to just the proteins in that cluster
  3. What functions do they have? Do they share a common theme?
  4. Why might these proteins cluster together despite coming from organisms that diverged billions of years ago?

Hints:

  • You already have the protein names and functions in your proteome tables — use them!
  • Look at the "Protein names" or "Function" columns for patterns
  • Conserved functions often relate to core cellular processes: metabolism, DNA replication, protein synthesis, etc.

Write 3-4 sentences about what you found and why it makes sense evolutionarily.

Success check:

  • You identified a shared function or theme in your cluster
  • You can explain why these proteins cluster together

Exercise 6: Reflection

Goal: Connect the map to biological insight.

Answer in your notebook (2-3 sentences each):

  1. If two proteins from different species end up in the same cluster, what does that suggest about them?

  2. What did you learn from this exercise about how embeddings capture biological information?


Deliverables

Submit your completed notebook (.ipynb) with:

  1. All code cells executed
  2. Your interactive UMAP visualization (or a screenshot if it doesn't render)
  3. Analysis of at least one mixed cluster
  4. Reflection answers
  5. A Logbook section at the end with [LOGBOOK] entries

Reminder: We've switched to including [LOGBOOK] entries directly in each route's notebook rather than in a separate file. Add your logbook entries as markdown cells at the end of this notebook.

Submission

Submit your notebook here

🎉 Route Complete!

Great work!