Route 016: BLAST vs Embeddings — Finding Protein Homologs

RouteID: 016
Wall: Protein Representations (W05)
Grade: 5.10a
Routesetter: Adrian
Date: 02/02/2026
Dataset: Human query proteins + E. coli proteome embeddings

Why this route exists

BLAST finds similar proteins by aligning sequences. Embedding similarity finds similar proteins by comparing learned vector representations. Do they find the same hits?

In this route, you'll take a few human proteins — some with well-known E. coli homologs, some without — and search for matches in the E. coli proteome using both methods. You'll compare the results and see where they agree and disagree.

Query Proteins

Protein Name	Description	Expected E. coli Homolog?	Human UniProt ID	E. coli UniProt ID
GAPDH	Glycolytic enzyme, highly conserved across all life	Yes (gapA)	P04406
Thioredoxin	Small redox protein, ancient and universal	Yes (trxA)	P10599
Hemoglobin β	Oxygen transport in blood, eukaryote-specific	No	P68871	—
p53	Tumor suppressor, eukaryote-specific	No	P04637	—

Prerequisites

R013 (The UniProt Topo Guide) — you should have the E. coli proteome data and embeddings file
R015 (Vector Spaces & Projections) — you should be comfortable with cosine similarity and working with embeddings

What you'll be able to do after this route

By the end, you can:

Run a BLAST search against a proteome (or interpret pre-computed results)
Search for similar proteins using embedding cosine similarity
Compare ranked hit lists from two different methods
Identify where sequence similarity and embedding similarity disagree
Reason about what each method captures and when to use which

Key definitions

BLAST (Basic Local Alignment Search Tool) An algorithm that finds regions of similarity between sequences. It compares your query sequence to a database and returns hits ranked by alignment score. Sequence-based — only looks at amino acid matches.

Homolog A protein related by evolutionary descent. Orthologs are homologs in different species that evolved from a common ancestor (often similar function). Paralogs are homologs within the same species from gene duplication.

E-value BLAST's measure of statistical significance. Lower = more significant. An E-value of 1e-50 means you'd expect to find a match this good by chance once in 10^50 random searches.

Cosine similarity (embedding-based) A measure of how similar two embedding vectors are, based on the angle between them. Ranges from -1 to 1. Captures patterns learned by the PLM — which may include structure, function, or evolutionary relationships beyond raw sequence.

Exercise 0: Fetch Your Query Proteins

Goal: Fetch the sequences for the four human query proteins using their UniProt IDs.

Write a function that takes a UniProt ID and returns the amino acid sequence.
Use it to fetch all 4 query proteins from the table above.
Store them in a dictionary: {uniprot_id: sequence}
Print the length of each sequence to verify.

Hints:

UniProt has a REST API. The FASTA for any protein is at: https://rest.uniprot.org/uniprotkb/{ID}.fasta
Python's urllib.request.urlopen() can fetch URLs
BioPython's SeqIO.read() can parse FASTA format
Or just do it manually — click the links, copy the sequences. No shame in that.

Success check:

You have all 4 sequences loaded
GAPDH should be ~335 aa, Thioredoxin ~105 aa, Hemoglobin β ~147 aa, p53 ~393 aa

Exercise 1: BLAST Search

Goal: Find E. coli proteins similar to your query proteins using local BLAST.

We'll run BLAST locally instead of using the slow web API. This takes seconds instead of minutes.

Step 1: Install BLAST+ command-line tools

!apt-get install -y ncbi-blast+

Step 2: Prepare your files

You need:

query_proteins.fasta — your 4 human proteins (write them to a FASTA file)
ecoli.fasta — the E. coli proteome (you downloaded this in R013)

Step 3: Build a local BLAST database

!makeblastdb -in ecoli.fasta -dbtype prot -out ecoli_db

Step 4: Run BLAST

!blastp -query query_proteins.fasta -db ecoli_db -out blast_results.xml -outfmt 5 -evalue 1e-5

Step 5: Parse the results

from Bio.Blast import NCBIXML

with open("blast_results.xml") as f:
    blast_records = list(NCBIXML.parse(f))  # one record per query protein

for record in blast_records:
    print(f"\n{record.query}")
    for alignment in record.alignments[:10]:  # top 10 hits
        hsp = alignment.hsps[0]
        print(f"  {alignment.hit_def[:50]}")
        print(f"    E-value: {hsp.expect}, Identity: {hsp.identities}/{hsp.align_length}")

For each query, extract the top 10 hits.
For each hit, record:
- UniProt ID (see warning below)
- E-value (from hsp.expect)
- Percent identity (compute from hsp.identities / hsp.align_length)
Store results in a DataFrame or dictionary.

Important — extracting UniProt IDs correctly:

alignment.accession returns a BLAST internal ID, not the UniProt accession. The UniProt ID is buried in alignment.hit_def, which looks like: sp|P0A9B2|GAPDH_ECOLI ...

Extract it like this:

uniprot_id = alignment.hit_def.split("|")[1]  # → "P0A9B2"

Make sure your BLAST results use these UniProt IDs — you'll need them to match against the embedding file later.

Success check:

You have top 10 BLAST hits for each query protein
You can explain what E-value and percent identity mean

Common fall:

Forgetting to write your query proteins to a FASTA file before running BLAST.
Using the wrong path to ecoli.fasta.

Exercise 2: Embedding Similarity Search

Goal: Find E. coli proteins similar to your query proteins using embeddings.

Download the query protein embeddings file.
- We provide query_proteins.h5 containing ProtT5 embeddings for the 4 human proteins
- Same model as your E. coli embeddings — apples to apples
- Download: query_proteins.h5
Load both embedding files:
- Query proteins: query_proteins.h5 (4 proteins)
- E. coli proteome: per-protein.h5 from R013 (~4,400 proteins)
Compute cosine similarity between each human query protein and ALL E. coli proteins.
Rank E. coli proteins by similarity and get the top 10 hits for each query.
Store results in a DataFrame or dictionary — same format as your BLAST results.

Tools:

h5py to load both files
Your compute_distance function from R015 (or scipy.spatial.distance.cosine)
Tip: Vectorize! Don't loop over 4,400 proteins one at a time.

Success check:

You have top 10 embedding hits for each query protein
You can look up the names of your top hits

Exercise 3: Compare the Hit Lists

Goal: Quantify how much BLAST and embeddings agree.

For each query protein, you now have two ranked lists of E. coli hits:

Top 10 from BLAST
Top 10 from embeddings

Compute the overlap: how many proteins appear in both top-10 lists?
Compute the Jaccard similarity: |intersection| / |union|
If hits appear in both lists, compare their ranks. Are the rankings similar or different?
Create a summary table showing overlap for each query protein.

Bonus: Compute rank correlation (Spearman) for proteins that appear in both lists.

Success check:

You know the overlap for each query protein (e.g., "7 out of 10 hits are the same")
You have a sense of whether the methods mostly agree or mostly disagree

Exercise 4: Investigate the Disagreements

Goal: Understand why BLAST and embeddings give different answers.

This is the most interesting part!

For each query protein, identify:
- BLAST-only hits: proteins in BLAST top 10 but NOT in embedding top 10
- Embedding-only hits: proteins in embedding top 10 but NOT in BLAST top 10
Look up these disagreement proteins on UniProt. For each one, note:
- Protein name and function
- Any structural or functional annotations
Answer in your notebook:
- Why might BLAST rank a protein highly that embeddings miss? (Hint: think about sequence conservation)
- Why might embeddings rank a protein highly that BLAST misses? (Hint: think about what PLMs learn)
Pick one specific disagreement and write 2-3 sentences explaining what you think is happening.

Success check:

You identified at least 2-3 disagreement proteins
You have a hypothesis for why each method ranked them differently

Common fall:

Assuming one method is "right" and the other is "wrong." Both methods are capturing real signal — just different kinds of signal.

Exercise 5: Visualize in Embedding Space

Goal: See where your query proteins and hits land in 2D.

Important: UMAP coordinates are based on relationships between all points. You can't run UMAP on E. coli separately and then "add" the human proteins later — you need to project everything together.

Combine all embeddings into one matrix:
- Stack your E. coli embeddings (~4,400 proteins)
- Add your 4 human query protein embeddings
- You should have a matrix of shape [~4404, 1024]
Run UMAP on the combined matrix to get 2D coordinates for everything at once.
Create a scatter plot showing:
- All E. coli proteins (gray, small)
- Your 4 query proteins (large, distinct color, labeled)
- BLAST top hits (highlighted in one color)
- Embedding top hits (highlighted in another color)
- Hits that appear in both (special marker)
Interpret: Do the BLAST hits and embedding hits cluster near the query protein? Are they in the same region or scattered?

Success check:

You have a 2D visualization with query proteins and both types of hits labeled
You can see whether the hit sets overlap spatially

Common fall:

Running UMAP separately on E. coli and human proteins. UMAP captures relationships — if you project them separately, the coordinates are meaningless relative to each other.

Exercise 6: Reflection

Goal: Synthesize what you learned about sequence vs. embedding similarity.

Answer these questions in your notebook (2-3 sentences each):

When would you trust BLAST over embeddings? Give a specific scenario.
When might embeddings find something BLAST misses? Give a specific scenario.
If you were building a tool to find drug targets in a new bacterial genome, would you use BLAST, embeddings, or both? Why?
What surprised you most about comparing these two methods?

Success check:

Your answers show understanding of what each method captures
You're not just saying "embeddings are better" or "BLAST is better" — you understand the tradeoffs

Deliverables

Submit your Colab/Jupyter notebook (.ipynb) with all exercises completed.

Include a Logbook section at the end of your notebook with [LOGBOOK] entries — short reflections in markdown cells about what you're thinking, what confused you, or what you learned.

Submission

Submit your notebook here