🎉 Sent!
You made it to the top. Submit your work above!
Submission
Deliverables
Submit your completed notebook (.ipynb) with:
- Data loading and exploration (Exercise 1)
- Transformer embedding code (Exercise 2)
- Classifier training on frozen embeddings (Exercise 3)
- AUROC and AU-PRC evaluation on easy and hard negatives (Exercise 4)
- Reflection answers in markdown cells (Exercise 5)
Exercise 5: Reflection
Goal: Think about what you learned and what comes next.
Answer in your notebook (2-3 sentences each):
-
How did AUROC compare between easy and hard negatives? Why do you think that is?
-
The model was pre-trained on molecular structure, then fine-tuned on binding. What "knowledge" might the pre-training provide?
-
We used a simple classification head (linear layer). What other approaches might you try?
-
You now have a baseline. The next routes try different approaches: R031 fine-tunes the transformer, R030 uses contrastive learning. What do you predict will help more — and why?
Exercise 4: Evaluation
Goal: Measure performance with AUROC and AU-PRC on both easy and hard negatives.
Two metrics you should know
AUROC (Area Under ROC Curve): Measures how well the model ranks positives above negatives. 0.5 = random guessing, 1.0 = perfect.
AU-PRC (Area Under Precision-Recall Curve): Like AUROC, but better for imbalanced datasets. When you have many more negatives than positives (common in drug discovery!), AU-PRC gives you a clearer picture of how well you're finding the rare positives.
Compute both metrics
Use roc_auc_score for AUROC and average_precision_score for AU-PRC from sklearn.
Ask your chatbot:
"How do I compute AUROC and AU-PRC using sklearn with cross-validated predictions?"
Confused about evaluation metrics? Climb R017 (Your First Classifier) for the basics, or check out the draft route Beyond Accuracy for a deeper dive into metrics for imbalanced datasets.
Do this for both:
- Easy negatives dataset (
X_easy,y_easy) - Hard negatives dataset (
X_hard,y_hard)
Questions:
- Which has higher AUROC — easy or hard negatives? What about AU-PRC?
- Why does this make sense given what you know about how the negatives were constructed?
- Why might AU-PRC tell a different story than AUROC for imbalanced data?
Success check:
- AUROC and AU-PRC computed for both easy and hard negatives
- You can explain the difference between easy and hard
- You understand when to use AU-PRC vs AUROC
Exercise 3: Training the Classifier
Goal: Train a classifier on frozen transformer embeddings.
The approach: Frozen embeddings
We'll use a frozen embeddings approach:
- Extract embeddings for all molecules using the pre-trained transformer (done in Exercise 2)
- Train a simple classifier on top of those embeddings
SMILES → Transformer → [CLS] embedding → Classifier → P(binder)
|__________________________| |__________________|
↓ ↓
Frozen (pre-trained) What you train
Extract once, reuse Learns to predict binding
The transformer weights stay fixed — we're just using it as a feature extractor. This is fast, memory-efficient, and works great on a laptop or free Colab.
Note: An alternative is to fine-tune the transformer end-to-end, updating both the transformer and classifier weights together. This can give better performance but requires a GPU and more memory — too compute-intensive for this class, but worth exploring in future projects!
Extract all embeddings
First, embed all your molecules. This might take a few minutes:
from tqdm import tqdm
import numpy as np
def embed_dataset(smiles_list, batch_size=32):
embeddings = []
for i in tqdm(range(0, len(smiles_list), batch_size)):
batch = smiles_list[i:i+batch_size]
emb = get_embedding(batch) # your function from Exercise 2
embeddings.append(emb.numpy())
return np.vstack(embeddings)
X_easy = embed_dataset(df_easy['smiles'].tolist())
y_easy = df_easy['label'].values
Train a classifier with cross-validation
Train a simple classifier (like LogisticRegression) on your embeddings using 5-fold cross-validation.
Ask your chatbot:
"How do I train a LogisticRegression classifier with StratifiedKFold cross-validation in sklearn?"
Feeling lost? If sklearn classifiers and cross-validation are new to you, the doctor prescribes climbing R017 (Your First Classifier) first. It covers the fundamentals you'll need here.
Questions:
- Why do we use
StratifiedKFoldinstead of regularKFold? (Hint: think about class balance) - How long did it take to embed all molecules? How long to train the classifier?
Success check:
- You have embeddings for all molecules in both datasets
- Classifier trains without errors
- Ready to evaluate in Exercise 4
Exercise 2: Getting Molecular Embeddings
Goal: Load a pre-trained molecular transformer and extract embeddings.
The big picture
Right now, your molecules are just text strings (SMILES). But ML models don't understand text — they need numbers. An embedding is a way to represent a molecule as a vector of numbers (like 768 floats) that captures its "meaning." Similar molecules should have similar embeddings.
You could train a model from scratch to learn these embeddings, but that takes massive amounts of data. Instead, we'll use a pre-trained model — someone else already trained a transformer on millions of molecules, and we can reuse what it learned.
Using Google Colab? Connect to a GPU for faster embedding extraction: Runtime → Change runtime type → Hardware accelerator → GPU (T4)
What is HuggingFace?
HuggingFace is like GitHub for ML models. Researchers upload pre-trained models, and you can download and use them with a few lines of code. Browse models at huggingface.co/models.
To use HuggingFace models, you need the transformers library:
pip install transformers
Finding a molecular transformer
Ask your chatbot:
"What pre-trained molecular transformer models are available on HuggingFace that can embed SMILES strings? How do I choose one?"
Good options include ChemBERTa, MolBERT, or similar. These were trained on large chemical datasets to understand molecular structure.
For this route, seyonec/ChemBERTa-zinc-base-v1 is a solid starting choice — it's small, fast, and well-documented. But this is an active research area! Newer models like DeepChem/ChemBERTa-77M-MLM may perform better. Feel free to explore and compare — part of ML is learning to evaluate different pre-trained models for your task.
Loading the model and getting embeddings
from transformers import AutoTokenizer, AutoModel
import torch
model_name = "your-chosen-model" # e.g., "seyonec/ChemBERTa-zinc-base-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
def get_embedding(smiles):
inputs = tokenizer(smiles, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
# Extract [CLS] token embedding
embedding = outputs.last_hidden_state[:, 0, :]
return embedding
Test it:
- Embed a few SMILES from your dataset
- Check the embedding dimension
- Verify it runs without errors
Ask your chatbot:
"What is the [CLS] token in a transformer model, and why do we use it for classification?"
Hint: If you've used protein language models (PLMs), you might be used to mean-pooling residue embeddings to get a single protein vector. The [CLS] token is a similar idea — it's a special token trained to summarize the whole sequence. Both approaches give you one vector per molecule; [CLS] is just the BERT-style convention.
Questions:
- What's the embedding dimension? (Hint: check
embedding.shape) - How long does it take to embed one molecule? (This matters for training speed)
- Why do we use
torch.no_grad()when getting embeddings?
Success check:
- You can embed any SMILES string
- You understand what embeddings are and why we need them
- You can explain what the [CLS] token represents
Exercise 1: Load and Explore the Data
Goal: Understand what you're working with.
Download the data
Save the file as paired_positives.parquet in your working directory.
Load the parquet file
import pandas as pd
df = pd.read_parquet("paired_positives.parquet")
print(df.shape)
print(df.columns)
df.head()
Explore the structure
The dataset contains molecules tested against MAPK14 (a kinase target). Each row is a paired triplet:
positive: A molecule that did bind to the target (a "hit")easy: A molecule that didn't bind, and is structurally different from the positivehard: A molecule that didn't bind, but is structurally similar to the positive (shares 2 of 3 building blocks)
Why this structure? Karen organized the data this way because it's perfect for the contrastive learning route (R030) that comes next. Each positive is already paired with matched negatives — one that's obviously different (easy) and one that's deceptively similar (hard). For now, we'll flatten this into a standard classification format, but you'll see the paired structure pay off later.
Convert to classification format
For classification, we need long format with smiles and label columns:
# Create easy negatives dataset
df_easy = pd.concat([
df[['positive']].rename(columns={'positive': 'smiles'}).assign(label=1),
df[['easy']].rename(columns={'easy': 'smiles'}).assign(label=0)
]).reset_index(drop=True)
# Create hard negatives dataset
df_hard = pd.concat([
df[['positive']].rename(columns={'positive': 'smiles'}).assign(label=1),
df[['hard']].rename(columns={'hard': 'smiles'}).assign(label=0)
]).reset_index(drop=True)
print(f"Easy dataset: {len(df_easy)} molecules")
print(f"Hard dataset: {len(df_hard)} molecules")
Questions:
- How many rows are in the original paired dataset? How many molecules total after converting to long format?
- What's the class balance in each dataset?
- Look at a few SMILES strings — can you visually tell binders from non-binders? (Spoiler: probably not)
- Why do you think Karen paired each positive with exactly one easy and one hard negative?
Success check:
- Data loads without errors
- You have two DataFrames:
df_easyanddf_hard - You understand the difference between paired format and classification format
Background: What is DEL?
A DNA-Encoded Library (DEL) is a powerful drug discovery technology. Instead of testing molecules one by one, you attach a unique DNA barcode to each molecule, mix millions of them together, and expose them to a protein target. Molecules that bind stick around; non-binders wash away. Then you sequence the DNA to see which molecules were enriched.
The result is a massive dataset — hundreds of millions of molecules with enrichment scores indicating how likely they are to bind. But DEL data is noisy. A molecule might show up enriched by chance, or a true binder might be missed due to experimental variation.
The ML challenge: Can you train a model to distinguish real binders from noise? And can you generalize to molecules you've never seen?
Before you start
Ask your chatbot to explain:
- "What is a DNA-Encoded Library and how does affinity selection work?"
- "What does 'enrichment' mean in DEL screening? How is it calculated?"
- "What is MAPK14 (p38α) and why is it an important drug target?"
This background will help you understand why the data looks the way it does.
Why this route exists
Before you can improve something, you need a baseline.
In this route, you'll use a pre-trained molecular transformer to classify binders vs non-binders. This is the simplest approach — use frozen embeddings with a lightweight classifier on top.
The next routes try more sophisticated approaches: R031 fine-tunes the transformer end-to-end, R030 uses contrastive learning. But to know if those help, you need to know how well the simple baseline works first.
You'll also see something important: the difference between easy and hard negatives. Easy negatives are molecules that look nothing like binders — any decent model can tell them apart. Hard negatives are structurally similar to binders but don't actually bind. That's where the real challenge is.
What you'll be able to do after this route
By the end, you can:
- Load and use a pre-trained molecular transformer from HuggingFace
- Use frozen embeddings for binary classification
- Run 5-fold cross-validation
- Evaluate with AUROC and AU-PRC
- Compare performance on easy vs hard negatives
- Explain why hard negatives are... harder
Key definitions
Molecular Transformer A transformer model pre-trained on molecular data (like SMILES strings). It learns representations of molecular structure that can be fine-tuned for downstream tasks.
AUROC (Area Under ROC Curve) A metric for binary classification that measures how well the model ranks positives above negatives. 0.5 = random guessing, 1.0 = perfect separation.
AU-PRC (Area Under Precision-Recall Curve) Similar to AUROC, but more informative for imbalanced datasets. Better at revealing how well you're finding rare positives among many negatives.
Frozen Embeddings Using a pre-trained model as a fixed feature extractor. You extract embeddings once, then train a simple classifier on top. Fast and memory-efficient.
Cross-validation Splitting data into K folds, training on K-1, evaluating on 1, and rotating. Gives more robust performance estimates than a single train/test split.
Dataset credit: Karen Pu pre-processed this dataset from the KinDEL benchmark (Chen et al., 2025). The original KinDEL dataset is much larger — she downsampled using cluster-aware sampling to preserve chemical diversity.
Route 028: Your First Molecular Transformer
- RouteID: 028
- Wall: The DEL Wall (W08)
- Grade: 5.10a
- Routesetters: Karen + Adrian
- Time: ~1 hour
- Dataset: MAPK14 from KinDEL benchmark (Chen et al., 2025)
- Prerequisites: Hello PyTorch, comfortable with HuggingFace basics
UNDER CONSTRUCTION
This route is being actively built. The dataset link and some details may change. Check back for updates!
🧗 Base Camp
Start here and climb your way up!