Navigate
Back to Gym
← Back to Wall

Practice Draft 36A - Generate Residue-Level ESM Embeddings

Route ID: R036A • Wall: W05 • Released: Mar 3, 2026

5.10a
ready

🎉 Sent!

You made it to the top. Submit your work above!

Submission

Submit your notebook here


Deliverable

Submit one notebook that includes:

  1. Loaded the provided Mtb UniProt table with sequence column
  2. Randomly sampled 10-20 proteins from that table
  3. Generated residue-level ESM-2 embeddings for the sampled proteins
  4. Saved protein_embeddings.pkl in this format: {uniprot_id: np.ndarray(L, 1280)}
  5. Reloaded the file and printed key shape checks

Mission checklist

  • Opened a GPU runtime
  • Loaded Mtb table and identified ID/sequence columns
  • Sampled 10-20 proteins
  • Generated residue-level ESM embeddings
  • Saved + reloaded protein_embeddings.pkl
  • Confirmed embedding width is 1280

Exercise 2: Save Reload and Validate

  1. Save your dictionary to protein_embeddings.pkl
  2. Reload it with pickle.load(...)
  3. Print:
    • type(data)
    • number of proteins
    • first few IDs
    • one example shape
  4. Confirm each value is residue-level (L, 1280)

Success target: at least 10 proteins embedded and reload works without errors.


Exercise 1: Generate ESM Embeddings from Sampled TB Proteins

Use the sampled proteins to run ESM-2 and store residue-level embeddings.

import pickle
import pandas as pd
import torch
import esm

# 1) Load table
# Tip: export your Google Sheet as CSV and point to that local file
table_path = "tb_uniprot_with_sequence.csv"
df = pd.read_csv(table_path)

id_col = next(c for c in ["Entry", "UniProt_ID", "Accession", "accession", "id"] if c in df.columns)
seq_col = next(c for c in ["Sequence", "sequence"] if c in df.columns)

# 2) Sample 10-20 proteins
clean = df.dropna(subset=[id_col, seq_col]).copy()
sample_n = min(15, len(clean))
sampled = clean.sample(n=sample_n, random_state=169)

# 3) Load model
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()

if torch.cuda.is_available():
    model = model.cuda()

# 4) Embed
emb_dict = {}
for _, row in sampled.iterrows():
    uid = str(row[id_col]).strip()
    seq = str(row[seq_col]).replace(" ", "").strip()
    if len(seq) < 12:
        continue

    _, _, toks = batch_converter([(uid, seq)])
    if torch.cuda.is_available():
        toks = toks.cuda()

    with torch.no_grad():
        out = model(toks, repr_layers=[33], return_contacts=False)

    reps = out["representations"][33][0]                 # [tokens, 1280]
    residue_reps = reps[1:len(seq)+1].detach().cpu().numpy()  # remove BOS/EOS
    emb_dict[uid] = residue_reps

print({k: v.shape for k, v in list(emb_dict.items())[:5]})
print(f"Embedded proteins: {len(emb_dict)}")

Exercise 0: Setup Data GPU and Tools

Data and references

Runtime requirements

  • Use GPU runtime (Colab T4/L4/A100 or equivalent)
  • CPU-only runs are typically too slow for this route

Install

!pip -q install fair-esm pandas

Suggested chatbot assist

"I am doing CHEM 169 Route 36A. I shared the ESM GitHub + README and route instructions. Give me a beginner-friendly Colab plan for GPU setup, loading the TB table, sampling 10-20 proteins, generating residue-level embeddings, and saving protein_embeddings.pkl. Also list common mistakes and quick fixes."


Intro

This route is the first half of the PLM practice final prep.

Goal: learn how to run ESM and build your own residue-level embedding dataset from TB proteins.

In earlier routes you often consumed ready-made embeddings. Here you generate them yourself, validate them, and package them for downstream analysis.

Next route: use this output for cosine-profile and heatmap analysis in Route 36B.


Route 036A: Generate Residue-Level ESM Embeddings

  • RouteID: 036A
  • Wall: Protein Representations (W05)
  • Grade: 5.10a
  • Routesetter: Course Staff
  • Time: ~45-60 minutes
  • You'll need: GPU notebook runtime, Mtb UniProt sequence table, ESM docs, and fair-esm.

🧗 Base Camp

Start here and climb your way up!