🧗 Start Here
Scroll down to complete this route
Route 016: BLAST vs Embeddings — Finding Protein Homologs
- RouteID: 016
- Wall: Protein Representations (W05)
- Grade: 5.10a
- Routesetter: Adrian
- Date: 02/02/2026
- Dataset: Human query proteins + E. coli proteome embeddings
Why this route exists
BLAST finds similar proteins by aligning sequences. Embedding similarity finds similar proteins by comparing learned vector representations. Do they find the same hits?
In this route, you'll take a few human proteins — some with well-known E. coli homologs, some without — and search for matches in the E. coli proteome using both methods. You'll compare the results and see where they agree and disagree.
Query Proteins
| Protein Name | Description | Expected E. coli Homolog? | Human UniProt ID | E. coli UniProt ID |
|---|---|---|---|---|
| GAPDH | Glycolytic enzyme, highly conserved across all life | Yes (gapA) | P04406 | |
| Thioredoxin | Small redox protein, ancient and universal | Yes (trxA) | P10599 | |
| Hemoglobin β | Oxygen transport in blood, eukaryote-specific | No | P68871 | — |
| p53 | Tumor suppressor, eukaryote-specific | No | P04637 | — |
Prerequisites
- R013 (The UniProt Topo Guide) — you should have the E. coli proteome data and embeddings file
- R015 (Vector Spaces & Projections) — you should be comfortable with cosine similarity and working with embeddings
What you'll be able to do after this route
By the end, you can:
- Run a BLAST search against a proteome (or interpret pre-computed results)
- Search for similar proteins using embedding cosine similarity
- Compare ranked hit lists from two different methods
- Identify where sequence similarity and embedding similarity disagree
- Reason about what each method captures and when to use which
Key definitions
BLAST (Basic Local Alignment Search Tool) An algorithm that finds regions of similarity between sequences. It compares your query sequence to a database and returns hits ranked by alignment score. Sequence-based — only looks at amino acid matches.
Homolog A protein related by evolutionary descent. Orthologs are homologs in different species that evolved from a common ancestor (often similar function). Paralogs are homologs within the same species from gene duplication.
E-value BLAST's measure of statistical significance. Lower = more significant. An E-value of 1e-50 means you'd expect to find a match this good by chance once in 10^50 random searches.
Cosine similarity (embedding-based) A measure of how similar two embedding vectors are, based on the angle between them. Ranges from -1 to 1. Captures patterns learned by the PLM — which may include structure, function, or evolutionary relationships beyond raw sequence.
Exercise 0: Fetch Your Query Proteins
Goal: Fetch the sequences for the four human query proteins using their UniProt IDs.
- Write a function that takes a UniProt ID and returns the amino acid sequence.
- Use it to fetch all 4 query proteins from the table above.
- Store them in a dictionary:
{uniprot_id: sequence} - Print the length of each sequence to verify.
Hints:
- UniProt has a REST API. The FASTA for any protein is at:
https://rest.uniprot.org/uniprotkb/{ID}.fasta - Python's
urllib.request.urlopen()can fetch URLs - BioPython's
SeqIO.read()can parse FASTA format - Or just do it manually — click the links, copy the sequences. No shame in that.
Success check:
- You have all 4 sequences loaded
- GAPDH should be ~335 aa, Thioredoxin ~105 aa, Hemoglobin β ~147 aa, p53 ~393 aa
Exercise 1: BLAST Search
Goal: Find E. coli proteins similar to your query proteins using local BLAST.
We'll run BLAST locally instead of using the slow web API. This takes seconds instead of minutes.
Step 1: Install BLAST+ command-line tools
!apt-get install -y ncbi-blast+
Step 2: Prepare your files
You need:
query_proteins.fasta— your 4 human proteins (write them to a FASTA file)ecoli.fasta— the E. coli proteome (you downloaded this in R013)
Step 3: Build a local BLAST database
!makeblastdb -in ecoli.fasta -dbtype prot -out ecoli_db
Step 4: Run BLAST
!blastp -query query_proteins.fasta -db ecoli_db -out blast_results.xml -outfmt 5 -evalue 1e-5
Step 5: Parse the results
from Bio.Blast import NCBIXML
with open("blast_results.xml") as f:
blast_records = list(NCBIXML.parse(f)) # one record per query protein
for record in blast_records:
print(f"\n{record.query}")
for alignment in record.alignments[:10]: # top 10 hits
hsp = alignment.hsps[0]
print(f" {alignment.hit_def[:50]}")
print(f" E-value: {hsp.expect}, Identity: {hsp.identities}/{hsp.align_length}")
- For each query, extract the top 10 hits.
- For each hit, record:
- UniProt ID (see warning below)
- E-value (from
hsp.expect) - Percent identity (compute from
hsp.identities / hsp.align_length)
- Store results in a DataFrame or dictionary.
Important — extracting UniProt IDs correctly:
alignment.accession returns a BLAST internal ID, not the UniProt accession. The UniProt ID is buried in alignment.hit_def, which looks like: sp|P0A9B2|GAPDH_ECOLI ...
Extract it like this:
uniprot_id = alignment.hit_def.split("|")[1] # → "P0A9B2"
Make sure your BLAST results use these UniProt IDs — you'll need them to match against the embedding file later.
Success check:
- You have top 10 BLAST hits for each query protein
- You can explain what E-value and percent identity mean
Common fall:
- Forgetting to write your query proteins to a FASTA file before running BLAST.
- Using the wrong path to
ecoli.fasta.
Exercise 2: Embedding Similarity Search
Goal: Find E. coli proteins similar to your query proteins using embeddings.
-
Download the query protein embeddings file.
- We provide
query_proteins.h5containing ProtT5 embeddings for the 4 human proteins - Same model as your E. coli embeddings — apples to apples
- Download: query_proteins.h5
- We provide
-
Load both embedding files:
- Query proteins:
query_proteins.h5(4 proteins) - E. coli proteome:
per-protein.h5from R013 (~4,400 proteins)
- Query proteins:
-
Compute cosine similarity between each human query protein and ALL E. coli proteins.
-
Rank E. coli proteins by similarity and get the top 10 hits for each query.
-
Store results in a DataFrame or dictionary — same format as your BLAST results.
Tools:
h5pyto load both files- Your
compute_distancefunction from R015 (orscipy.spatial.distance.cosine) - Tip: Vectorize! Don't loop over 4,400 proteins one at a time.
Success check:
- You have top 10 embedding hits for each query protein
- You can look up the names of your top hits
Exercise 3: Compare the Hit Lists
Goal: Quantify how much BLAST and embeddings agree.
For each query protein, you now have two ranked lists of E. coli hits:
- Top 10 from BLAST
- Top 10 from embeddings
- Compute the overlap: how many proteins appear in both top-10 lists?
- Compute the Jaccard similarity:
|intersection| / |union| - If hits appear in both lists, compare their ranks. Are the rankings similar or different?
- Create a summary table showing overlap for each query protein.
Bonus: Compute rank correlation (Spearman) for proteins that appear in both lists.
Success check:
- You know the overlap for each query protein (e.g., "7 out of 10 hits are the same")
- You have a sense of whether the methods mostly agree or mostly disagree
Exercise 4: Investigate the Disagreements
Goal: Understand why BLAST and embeddings give different answers.
This is the most interesting part!
- For each query protein, identify:
- BLAST-only hits: proteins in BLAST top 10 but NOT in embedding top 10
- Embedding-only hits: proteins in embedding top 10 but NOT in BLAST top 10
- Look up these disagreement proteins on UniProt. For each one, note:
- Protein name and function
- Any structural or functional annotations
- Answer in your notebook:
- Why might BLAST rank a protein highly that embeddings miss? (Hint: think about sequence conservation)
- Why might embeddings rank a protein highly that BLAST misses? (Hint: think about what PLMs learn)
- Pick one specific disagreement and write 2-3 sentences explaining what you think is happening.
Success check:
- You identified at least 2-3 disagreement proteins
- You have a hypothesis for why each method ranked them differently
Common fall:
- Assuming one method is "right" and the other is "wrong." Both methods are capturing real signal — just different kinds of signal.
Exercise 5: Visualize in Embedding Space
Goal: See where your query proteins and hits land in 2D.
Important: UMAP coordinates are based on relationships between all points. You can't run UMAP on E. coli separately and then "add" the human proteins later — you need to project everything together.
-
Combine all embeddings into one matrix:
- Stack your E. coli embeddings (~4,400 proteins)
- Add your 4 human query protein embeddings
- You should have a matrix of shape
[~4404, 1024]
-
Run UMAP on the combined matrix to get 2D coordinates for everything at once.
-
Create a scatter plot showing:
- All E. coli proteins (gray, small)
- Your 4 query proteins (large, distinct color, labeled)
- BLAST top hits (highlighted in one color)
- Embedding top hits (highlighted in another color)
- Hits that appear in both (special marker)
-
Interpret: Do the BLAST hits and embedding hits cluster near the query protein? Are they in the same region or scattered?
Success check:
- You have a 2D visualization with query proteins and both types of hits labeled
- You can see whether the hit sets overlap spatially
Common fall:
- Running UMAP separately on E. coli and human proteins. UMAP captures relationships — if you project them separately, the coordinates are meaningless relative to each other.
Exercise 6: Reflection
Goal: Synthesize what you learned about sequence vs. embedding similarity.
Answer these questions in your notebook (2-3 sentences each):
- When would you trust BLAST over embeddings? Give a specific scenario.
- When might embeddings find something BLAST misses? Give a specific scenario.
- If you were building a tool to find drug targets in a new bacterial genome, would you use BLAST, embeddings, or both? Why?
- What surprised you most about comparing these two methods?
Success check:
- Your answers show understanding of what each method captures
- You're not just saying "embeddings are better" or "BLAST is better" — you understand the tradeoffs
Deliverables
Submit your Colab/Jupyter notebook (.ipynb) with all exercises completed.
Include a Logbook section at the end of your notebook with [LOGBOOK] entries — short reflections in markdown cells about what you're thinking, what confused you, or what you learned.
Submission
🎉 Route Complete!
Great work!