🎉 Sent!
You made it to the top. Submit your work above!
Submission
Deliverable
Submit one notebook that includes:
- Loaded the provided Mtb UniProt table with sequence column
- Randomly sampled 10-20 proteins from that table
- Generated residue-level ESM-2 embeddings for the sampled proteins
- Saved
protein_embeddings.pklin this format:{uniprot_id: np.ndarray(L, 1280)} - Reloaded the file and printed key shape checks
Mission checklist
- Opened a GPU runtime
- Loaded Mtb table and identified ID/sequence columns
- Sampled 10-20 proteins
- Generated residue-level ESM embeddings
- Saved + reloaded
protein_embeddings.pkl - Confirmed embedding width is
1280
Exercise 2: Save Reload and Validate
- Save your dictionary to
protein_embeddings.pkl - Reload it with
pickle.load(...) - Print:
type(data)- number of proteins
- first few IDs
- one example shape
- Confirm each value is residue-level
(L, 1280)
Success target: at least 10 proteins embedded and reload works without errors.
Exercise 1: Generate ESM Embeddings from Sampled TB Proteins
Use the sampled proteins to run ESM-2 and store residue-level embeddings.
import pickle
import pandas as pd
import torch
import esm
# 1) Load table
# Tip: export your Google Sheet as CSV and point to that local file
table_path = "tb_uniprot_with_sequence.csv"
df = pd.read_csv(table_path)
id_col = next(c for c in ["Entry", "UniProt_ID", "Accession", "accession", "id"] if c in df.columns)
seq_col = next(c for c in ["Sequence", "sequence"] if c in df.columns)
# 2) Sample 10-20 proteins
clean = df.dropna(subset=[id_col, seq_col]).copy()
sample_n = min(15, len(clean))
sampled = clean.sample(n=sample_n, random_state=169)
# 3) Load model
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()
if torch.cuda.is_available():
model = model.cuda()
# 4) Embed
emb_dict = {}
for _, row in sampled.iterrows():
uid = str(row[id_col]).strip()
seq = str(row[seq_col]).replace(" ", "").strip()
if len(seq) < 12:
continue
_, _, toks = batch_converter([(uid, seq)])
if torch.cuda.is_available():
toks = toks.cuda()
with torch.no_grad():
out = model(toks, repr_layers=[33], return_contacts=False)
reps = out["representations"][33][0] # [tokens, 1280]
residue_reps = reps[1:len(seq)+1].detach().cpu().numpy() # remove BOS/EOS
emb_dict[uid] = residue_reps
print({k: v.shape for k, v in list(emb_dict.items())[:5]})
print(f"Embedded proteins: {len(emb_dict)}")
Exercise 0: Setup Data GPU and Tools
Data and references
- Mtb UniProt table (Entry + Sequence): download
- ESM repository/docs: evolutionaryscale/esm
Runtime requirements
- Use GPU runtime (Colab T4/L4/A100 or equivalent)
- CPU-only runs are typically too slow for this route
Install
!pip -q install fair-esm pandas
Suggested chatbot assist
"I am doing CHEM 169 Route 36A. I shared the ESM GitHub + README and route instructions. Give me a beginner-friendly Colab plan for GPU setup, loading the TB table, sampling 10-20 proteins, generating residue-level embeddings, and saving protein_embeddings.pkl. Also list common mistakes and quick fixes."
Intro
This route is the first half of the PLM practice final prep.
Goal: learn how to run ESM and build your own residue-level embedding dataset from TB proteins.
In earlier routes you often consumed ready-made embeddings. Here you generate them yourself, validate them, and package them for downstream analysis.
Next route: use this output for cosine-profile and heatmap analysis in Route 36B.
Route 036A: Generate Residue-Level ESM Embeddings
- RouteID: 036A
- Wall: Protein Representations (W05)
- Grade: 5.10a
- Routesetter: Course Staff
- Time: ~45-60 minutes
- You'll need: GPU notebook runtime, Mtb UniProt sequence table, ESM docs, and
fair-esm.
🧗 Base Camp
Start here and climb your way up!