Submission

Submit your notebook here

Deliverable

Submit one notebook that includes:

Loaded the provided Mtb UniProt table with sequence column
Randomly sampled 10-20 proteins from that table
Generated residue-level ESM-2 embeddings for the sampled proteins
Saved protein_embeddings.pkl in this format: {uniprot_id: np.ndarray(L, 1280)}
Reloaded the file and printed key shape checks

Mission checklist

Opened a GPU runtime
Loaded Mtb table and identified ID/sequence columns
Sampled 10-20 proteins
Generated residue-level ESM embeddings
Saved + reloaded protein_embeddings.pkl
Confirmed embedding width is 1280

Exercise 2: Save Reload and Validate

Save your dictionary to protein_embeddings.pkl
Reload it with pickle.load(...)
Print:
- type(data)
- number of proteins
- first few IDs
- one example shape
Confirm each value is residue-level (L, 1280)

Success target: at least 10 proteins embedded and reload works without errors.

Exercise 1: Generate ESM Embeddings from Sampled TB Proteins

Use the sampled proteins to run ESM-2 and store residue-level embeddings.

import pickle
import pandas as pd
import torch
import esm

# 1) Load table
# Tip: export your Google Sheet as CSV and point to that local file
table_path = "tb_uniprot_with_sequence.csv"
df = pd.read_csv(table_path)

id_col = next(c for c in ["Entry", "UniProt_ID", "Accession", "accession", "id"] if c in df.columns)
seq_col = next(c for c in ["Sequence", "sequence"] if c in df.columns)

# 2) Sample 10-20 proteins
clean = df.dropna(subset=[id_col, seq_col]).copy()
sample_n = min(15, len(clean))
sampled = clean.sample(n=sample_n, random_state=169)

# 3) Load model
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()

if torch.cuda.is_available():
    model = model.cuda()

# 4) Embed
emb_dict = {}
for _, row in sampled.iterrows():
    uid = str(row[id_col]).strip()
    seq = str(row[seq_col]).replace(" ", "").strip()
    if len(seq) < 12:
        continue

    _, _, toks = batch_converter([(uid, seq)])
    if torch.cuda.is_available():
        toks = toks.cuda()

    with torch.no_grad():
        out = model(toks, repr_layers=[33], return_contacts=False)

    reps = out["representations"][33][0]                 # [tokens, 1280]
    residue_reps = reps[1:len(seq)+1].detach().cpu().numpy()  # remove BOS/EOS
    emb_dict[uid] = residue_reps

print({k: v.shape for k, v in list(emb_dict.items())[:5]})
print(f"Embedded proteins: {len(emb_dict)}")

Exercise 0: Setup Data GPU and Tools

Data and references

Mtb UniProt table (Entry + Sequence): download
ESM repository/docs: evolutionaryscale/esm

Runtime requirements

Use GPU runtime (Colab T4/L4/A100 or equivalent)
CPU-only runs are typically too slow for this route

Install

!pip -q install fair-esm pandas

Suggested chatbot assist

"I am doing CHEM 169 Route 36A. I shared the ESM GitHub + README and route instructions. Give me a beginner-friendly Colab plan for GPU setup, loading the TB table, sampling 10-20 proteins, generating residue-level embeddings, and saving protein_embeddings.pkl. Also list common mistakes and quick fixes."

Intro

This route is the first half of the PLM practice final prep.

Goal: learn how to run ESM and build your own residue-level embedding dataset from TB proteins.

In earlier routes you often consumed ready-made embeddings. Here you generate them yourself, validate them, and package them for downstream analysis.

Next route: use this output for cosine-profile and heatmap analysis in Route 36B.

Route 036A: Generate Residue-Level ESM Embeddings

RouteID: 036A
Wall: Protein Representations (W05)
Grade: 5.10a
Routesetter: Course Staff
Time: ~45-60 minutes
You'll need: GPU notebook runtime, Mtb UniProt sequence table, ESM docs, and fair-esm.