Submission

Submit your notebook here

Deliverables

Submit your completed notebook (.ipynb) with:

Data loading and exploration (Exercise 1)
Transformer embedding code (Exercise 2)
Classifier training on frozen embeddings (Exercise 3)
AUROC and AU-PRC evaluation on easy and hard negatives (Exercise 4)
Reflection answers in markdown cells (Exercise 5)

Exercise 5: Reflection

Goal: Think about what you learned and what comes next.

Answer in your notebook (2-3 sentences each):

How did AUROC compare between easy and hard negatives? Why do you think that is?
The model was pre-trained on molecular structure, then fine-tuned on binding. What "knowledge" might the pre-training provide?
We used a simple classification head (linear layer). What other approaches might you try?
You now have a baseline. The next routes try different approaches: R031 fine-tunes the transformer, R030 uses contrastive learning. What do you predict will help more — and why?

Exercise 4: Evaluation

Goal: Measure performance with AUROC and AU-PRC on both easy and hard negatives.

Two metrics you should know

AUROC (Area Under ROC Curve): Measures how well the model ranks positives above negatives. 0.5 = random guessing, 1.0 = perfect.

AU-PRC (Area Under Precision-Recall Curve): Like AUROC, but better for imbalanced datasets. When you have many more negatives than positives (common in drug discovery!), AU-PRC gives you a clearer picture of how well you're finding the rare positives.

Compute both metrics

Use roc_auc_score for AUROC and average_precision_score for AU-PRC from sklearn.

Ask your chatbot:

"How do I compute AUROC and AU-PRC using sklearn with cross-validated predictions?"

Confused about evaluation metrics? Climb R017 (Your First Classifier) for the basics, or check out the draft route Beyond Accuracy for a deeper dive into metrics for imbalanced datasets.

Do this for both:

Easy negatives dataset (X_easy, y_easy)
Hard negatives dataset (X_hard, y_hard)

Questions:

Which has higher AUROC — easy or hard negatives? What about AU-PRC?
Why does this make sense given what you know about how the negatives were constructed?
Why might AU-PRC tell a different story than AUROC for imbalanced data?

Success check:

AUROC and AU-PRC computed for both easy and hard negatives
You can explain the difference between easy and hard
You understand when to use AU-PRC vs AUROC

Exercise 3: Training the Classifier

Goal: Train a classifier on frozen transformer embeddings.

The approach: Frozen embeddings

We'll use a frozen embeddings approach:

Extract embeddings for all molecules using the pre-trained transformer (done in Exercise 2)
Train a simple classifier on top of those embeddings

SMILES → Transformer → [CLS] embedding → Classifier → P(binder)
        |__________________________|     |__________________|
                    ↓                            ↓
           Frozen (pre-trained)           What you train
           Extract once, reuse            Learns to predict binding

The transformer weights stay fixed — we're just using it as a feature extractor. This is fast, memory-efficient, and works great on a laptop or free Colab.

Note: An alternative is to fine-tune the transformer end-to-end, updating both the transformer and classifier weights together. This can give better performance but requires a GPU and more memory — too compute-intensive for this class, but worth exploring in future projects!

Extract all embeddings

First, embed all your molecules. This might take a few minutes:

from tqdm import tqdm
import numpy as np

def embed_dataset(smiles_list, batch_size=32):
    embeddings = []
    for i in tqdm(range(0, len(smiles_list), batch_size)):
        batch = smiles_list[i:i+batch_size]
        emb = get_embedding(batch)  # your function from Exercise 2
        embeddings.append(emb.numpy())
    return np.vstack(embeddings)

X_easy = embed_dataset(df_easy['smiles'].tolist())
y_easy = df_easy['label'].values

Train a classifier with cross-validation

Train a simple classifier (like LogisticRegression) on your embeddings using 5-fold cross-validation.

Ask your chatbot:

"How do I train a LogisticRegression classifier with StratifiedKFold cross-validation in sklearn?"

Feeling lost? If sklearn classifiers and cross-validation are new to you, the doctor prescribes climbing R017 (Your First Classifier) first. It covers the fundamentals you'll need here.

Questions:

Why do we use StratifiedKFold instead of regular KFold? (Hint: think about class balance)
How long did it take to embed all molecules? How long to train the classifier?

Success check:

You have embeddings for all molecules in both datasets
Classifier trains without errors
Ready to evaluate in Exercise 4

Exercise 2: Getting Molecular Embeddings

Goal: Load a pre-trained molecular transformer and extract embeddings.

The big picture

Right now, your molecules are just text strings (SMILES). But ML models don't understand text — they need numbers. An embedding is a way to represent a molecule as a vector of numbers (like 768 floats) that captures its "meaning." Similar molecules should have similar embeddings.

You could train a model from scratch to learn these embeddings, but that takes massive amounts of data. Instead, we'll use a pre-trained model — someone else already trained a transformer on millions of molecules, and we can reuse what it learned.

Using Google Colab? Connect to a GPU for faster embedding extraction: Runtime → Change runtime type → Hardware accelerator → GPU (T4)

What is HuggingFace?

HuggingFace is like GitHub for ML models. Researchers upload pre-trained models, and you can download and use them with a few lines of code. Browse models at huggingface.co/models.

To use HuggingFace models, you need the transformers library:

pip install transformers

Finding a molecular transformer

Ask your chatbot:

"What pre-trained molecular transformer models are available on HuggingFace that can embed SMILES strings? How do I choose one?"

Good options include ChemBERTa, MolBERT, or similar. These were trained on large chemical datasets to understand molecular structure.

For this route, seyonec/ChemBERTa-zinc-base-v1 is a solid starting choice — it's small, fast, and well-documented. But this is an active research area! Newer models like DeepChem/ChemBERTa-77M-MLM may perform better. Feel free to explore and compare — part of ML is learning to evaluate different pre-trained models for your task.

Loading the model and getting embeddings

from transformers import AutoTokenizer, AutoModel
import torch

model_name = "your-chosen-model"  # e.g., "seyonec/ChemBERTa-zinc-base-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def get_embedding(smiles):
    inputs = tokenizer(smiles, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    # Extract [CLS] token embedding
    embedding = outputs.last_hidden_state[:, 0, :]
    return embedding

Test it:

Embed a few SMILES from your dataset
Check the embedding dimension
Verify it runs without errors

Ask your chatbot:

"What is the [CLS] token in a transformer model, and why do we use it for classification?"

Hint: If you've used protein language models (PLMs), you might be used to mean-pooling residue embeddings to get a single protein vector. The [CLS] token is a similar idea — it's a special token trained to summarize the whole sequence. Both approaches give you one vector per molecule; [CLS] is just the BERT-style convention.

Questions:

What's the embedding dimension? (Hint: check embedding.shape)
How long does it take to embed one molecule? (This matters for training speed)
Why do we use torch.no_grad() when getting embeddings?

Success check:

You can embed any SMILES string
You understand what embeddings are and why we need them
You can explain what the [CLS] token represents

Exercise 1: Load and Explore the Data

Goal: Understand what you're working with.

Download the data

Download the dataset here

Save the file as paired_positives.parquet in your working directory.

Load the parquet file

import pandas as pd

df = pd.read_parquet("paired_positives.parquet")
print(df.shape)
print(df.columns)
df.head()

Explore the structure

The dataset contains molecules tested against MAPK14 (a kinase target). Each row is a paired triplet:

positive: A molecule that did bind to the target (a "hit")
easy: A molecule that didn't bind, and is structurally different from the positive
hard: A molecule that didn't bind, but is structurally similar to the positive (shares 2 of 3 building blocks)

Why this structure? Karen organized the data this way because it's perfect for the contrastive learning route (R030) that comes next. Each positive is already paired with matched negatives — one that's obviously different (easy) and one that's deceptively similar (hard). For now, we'll flatten this into a standard classification format, but you'll see the paired structure pay off later.

Convert to classification format

For classification, we need long format with smiles and label columns:

# Create easy negatives dataset
df_easy = pd.concat([
    df[['positive']].rename(columns={'positive': 'smiles'}).assign(label=1),
    df[['easy']].rename(columns={'easy': 'smiles'}).assign(label=0)
]).reset_index(drop=True)

# Create hard negatives dataset
df_hard = pd.concat([
    df[['positive']].rename(columns={'positive': 'smiles'}).assign(label=1),
    df[['hard']].rename(columns={'hard': 'smiles'}).assign(label=0)
]).reset_index(drop=True)

print(f"Easy dataset: {len(df_easy)} molecules")
print(f"Hard dataset: {len(df_hard)} molecules")

Questions:

How many rows are in the original paired dataset? How many molecules total after converting to long format?
What's the class balance in each dataset?
Look at a few SMILES strings — can you visually tell binders from non-binders? (Spoiler: probably not)
Why do you think Karen paired each positive with exactly one easy and one hard negative?

Success check:

Data loads without errors
You have two DataFrames: df_easy and df_hard
You understand the difference between paired format and classification format

Background: What is DEL?

A DNA-Encoded Library (DEL) is a powerful drug discovery technology. Instead of testing molecules one by one, you attach a unique DNA barcode to each molecule, mix millions of them together, and expose them to a protein target. Molecules that bind stick around; non-binders wash away. Then you sequence the DNA to see which molecules were enriched.

The result is a massive dataset — hundreds of millions of molecules with enrichment scores indicating how likely they are to bind. But DEL data is noisy. A molecule might show up enriched by chance, or a true binder might be missed due to experimental variation.

The ML challenge: Can you train a model to distinguish real binders from noise? And can you generalize to molecules you've never seen?

Before you start

Ask your chatbot to explain:

"What is a DNA-Encoded Library and how does affinity selection work?"
"What does 'enrichment' mean in DEL screening? How is it calculated?"
"What is MAPK14 (p38α) and why is it an important drug target?"

This background will help you understand why the data looks the way it does.

Why this route exists

Before you can improve something, you need a baseline.

In this route, you'll use a pre-trained molecular transformer to classify binders vs non-binders. This is the simplest approach — use frozen embeddings with a lightweight classifier on top.

The next routes try more sophisticated approaches: R031 fine-tunes the transformer end-to-end, R030 uses contrastive learning. But to know if those help, you need to know how well the simple baseline works first.

You'll also see something important: the difference between easy and hard negatives. Easy negatives are molecules that look nothing like binders — any decent model can tell them apart. Hard negatives are structurally similar to binders but don't actually bind. That's where the real challenge is.

What you'll be able to do after this route

By the end, you can:

Load and use a pre-trained molecular transformer from HuggingFace
Use frozen embeddings for binary classification
Run 5-fold cross-validation
Evaluate with AUROC and AU-PRC
Compare performance on easy vs hard negatives
Explain why hard negatives are... harder

Key definitions

Molecular Transformer A transformer model pre-trained on molecular data (like SMILES strings). It learns representations of molecular structure that can be fine-tuned for downstream tasks.

AUROC (Area Under ROC Curve) A metric for binary classification that measures how well the model ranks positives above negatives. 0.5 = random guessing, 1.0 = perfect separation.

AU-PRC (Area Under Precision-Recall Curve) Similar to AUROC, but more informative for imbalanced datasets. Better at revealing how well you're finding rare positives among many negatives.

Frozen Embeddings Using a pre-trained model as a fixed feature extractor. You extract embeddings once, then train a simple classifier on top. Fast and memory-efficient.

Cross-validation Splitting data into K folds, training on K-1, evaluating on 1, and rotating. Gives more robust performance estimates than a single train/test split.

Dataset credit: Karen Pu pre-processed this dataset from the KinDEL benchmark (Chen et al., 2025). The original KinDEL dataset is much larger — she downsampled using cluster-aware sampling to preserve chemical diversity.

Route 028: Your First Molecular Transformer

RouteID: 028
Wall: The DEL Wall (W08)
Grade: 5.10a
Routesetters: Karen + Adrian
Time: ~1 hour
Dataset: MAPK14 from KinDEL benchmark (Chen et al., 2025)
Prerequisites: Hello PyTorch, comfortable with HuggingFace basics

UNDER CONSTRUCTION

This route is being actively built. The dataset link and some details may change. Check back for updates!