Submission

Submit your notebook here

Deliverables

Submit your completed notebook (.ipynb) with:

Baseline UMAP and silhouette score (Exercise 1)
Triplet construction code and example triplets (Exercise 2)
Training loss curve (Exercise 3)
Before/after UMAP + silhouette scores for easy negatives (Exercise 4)
Before/after UMAP + silhouette scores for hard negatives (Exercise 5)
AUROC and AU-PRC comparison to your R028 baseline (Exercise 6)
Reflection answers in markdown cells (Exercise 7)

Exercise 7: Reflection

Goal: Consolidate what you learned.

Answer in your notebook (2-3 sentences each):

Did contrastive learning improve AUROC and AU-PRC compared to your R028 baseline? By how much?
In your own words, what problem does contrastive learning solve that standard classification doesn't?
Why do hard negatives make contrastive learning harder? Is that a bad thing?
A colleague says "my classifier gets 90% accuracy on easy negatives, so the model is good." What would you say to them?
You used silhouette score to quantify cluster quality. Can you think of another metric that would measure whether the learned representations are useful? (Hint: think about downstream tasks.)
A recent paper from Leash Biosciences (Hermes, 2024) explicitly flags using continuous enrichment scores instead of binary labels as a "compelling future direction." Based on what you learned here, why might that matter for contrastive learning?

Exercise 6: The Verdict — Did It Help?

Goal: Compare contrastive learning to your baseline.

In R028, you trained a classifier and got AUROC scores for easy and hard negatives. Now let's see if contrastive learning did better.

Important: If you skipped R028, go back and do it first — you need that baseline to know if contrastive learning actually helped!

Use your fine-tuned embeddings for classification:

Take the contrastively-trained embeddings and train a simple classifier on top (logistic regression or a linear layer). Evaluate with AUROC on the same test sets.

Compare:

Approach	Easy AUROC	Easy AU-PRC	Hard AUROC	Hard AU-PRC
R028 Baseline (frozen embeddings)	???	???	???	???
R031 Fine-tuned (optional)	???	???	???	???
R030 Contrastive (this route)	???	???	???	???

Questions:

Did contrastive learning help? By how much (compare AUROC and AU-PRC)?
Did it help more on easy or hard negatives?
Do AUROC and AU-PRC tell the same story? If they differ, which do you trust more for this imbalanced dataset?
Was the extra complexity (and compute time) worth it?

Success check:

You have AUROC and AU-PRC numbers for both approaches
You can articulate whether contrastive learning was worth it for this dataset

Exercise 5: The Hard Negative Challenge

Goal: See what happens when negatives are structurally similar to positives.

Now repeat Exercises 2-4 using the hard negatives.

Rebuild your triplets using hard negatives, retrain your model from scratch (reload the pre-trained weights!), and re-visualize.

Important: You're starting fresh — don't fine-tune on top of your easy-negatives model. Reload the original pre-trained weights.

Questions:

How does the training loss curve compare to easy negatives?
Is the silhouette score improvement larger or smaller than with easy negatives?
Why are hard negatives... harder? What does this tell you about what the model needs to learn?
In a real drug discovery setting, which scenario (easy or hard) is more realistic?

Success check:

Before/after UMAP for hard negatives
Silhouette scores for hard negatives (before and after)
You can explain why hard negatives are a harder learning problem

Exercise 4: Did It Work? (Easy Negatives)

Goal: Visualize AND quantify the new embedding space.

Re-embed the same 500 molecules from Exercise 1 using your fine-tuned model. Run UMAP again and make a new scatter plot.

Put the two plots side by side — before and after fine-tuning.

Now compute the silhouette score again. Compare it to your baseline.

Questions:

Is the visual separation better after fine-tuning?
Did the silhouette score improve? By how much?
A better-looking UMAP is encouraging, but UMAP can lie. Why is silhouette score a more trustworthy metric?

Success check:

Side-by-side before/after UMAP plots
Before/after silhouette scores recorded
You can articulate whether fine-tuning helped and by how much

Exercise 3: Fine-Tune with Triplet Loss

Goal: Train your molecular encoder to produce better representations using triplet loss.

Compute note: Unlike R028's frozen embeddings, this exercise actually updates the transformer weights. You'll need a GPU — free Colab works, but Colab Pro is faster. Training should take 5-15 minutes depending on your setup.

Connect to a GPU in Colab: Go to Runtime → Change runtime type → Hardware accelerator → GPU (T4)

The idea behind triplet loss is simple: penalize the model when an anchor is closer to a negative than to its positive. PyTorch has a built-in loss function for this.

Ask your chatbot:

"What is PyTorch's triplet loss function called? How do I use it with embeddings?"

You'll need to:

Unfreeze your model so its weights can update
Embed each element of your triplets through the model
Compute the loss and backpropagate
Train for a few epochs (3-5 is enough to see movement)

Use a small learning rate (around 1e-5) — you're fine-tuning a pre-trained model, not training from scratch.

Plot your training loss over batches or epochs.

Questions:

What does it mean when the loss decreases?
What is the "margin" parameter in triplet loss? What happens conceptually if margin=0?
Why use such a small learning rate for fine-tuning?

Success check:

Training runs without errors
Loss decreases over epochs
You have a loss curve to include in your submission

Exercise 2: Building Triplets

Goal: Construct the training examples for contrastive learning.

In standard classification (what you did in R028), each training example is (molecule, label). In triplet-based contrastive learning, each training example is (anchor, positive, negative):

Anchor: a binder
Positive: a different binder
Negative: a non-binder

The model will be trained to make the anchor's embedding closer to the positive than to the negative.

Ask your chatbot:

"What are triplets in contrastive learning? How do anchor, positive, and negative relate to each other?"

Good news: The paired dataset already gives you anchors and negatives! You just need to pair each anchor with a different positive:

import random

positives = df['positive'].tolist()

def build_triplets(df, negative_col='easy', n_triplets=5000):
    """Build triplets from paired dataset.

    Args:
        df: DataFrame with 'positive', 'easy', 'hard' columns
        negative_col: which negative column to use ('easy' or 'hard')
        n_triplets: how many triplets to generate
    """
    triplets = []
    for _ in range(n_triplets):
        # Sample two different rows
        idx1, idx2 = random.sample(range(len(df)), 2)

        anchor = df.iloc[idx1]['positive']
        positive = df.iloc[idx2]['positive']  # different binder
        negative = df.iloc[idx1][negative_col]

        triplets.append((anchor, positive, negative))
    return triplets

Print a few example triplets so you can sanity-check them.

Questions:

We're sampling positives randomly. Can you think of a smarter strategy — for example, how might you pick the hardest possible positives? Why might that help or hurt?
How many unique triplets could you theoretically generate from your dataset? Is 5,000 a lot or a little?
Why does the dataset already pair each positive with a specific negative? What advantage does this give over random sampling?

Success check:

You have a list of (anchor SMILES, positive SMILES, negative SMILES) triplets
You understand what each element represents biologically

Exercise 1: Baseline — See the Embedding Space

Goal: Visualize how your pre-trained molecular encoder represents binders vs non-binders before contrastive training.

You already loaded a molecular transformer in R028. Use the same model here.

Embed a subset — 500 molecules (250 binders, 250 non-binders) is enough for visualization. Use the easy negatives dataset first.

Reduce to 2D with UMAP and make a scatter plot colored by label. Save this plot — you'll need it for comparison later.

Compute a baseline silhouette score on your embeddings (before UMAP). This gives you a number to compare against later.

from sklearn.metrics import silhouette_score

# embeddings: numpy array of shape (n_samples, embedding_dim)
# labels: numpy array of 0s and 1s
score = silhouette_score(embeddings, labels)
print(f"Baseline silhouette score: {score:.3f}")

Feeling lost? If UMAP or sklearn metrics are new to you, consider climbing R017 (Your First Classifier) first. R028 also covers embedding extraction in detail.

Questions:

Are binders and non-binders cleanly separated in this space?
The model was pre-trained on molecular structure, not on binding. Should we expect it to separate binders from non-binders out of the box?
What does it mean if the classes are mixed together in embedding space?
What's your baseline silhouette score? (It's probably low — that's expected.)

Success check:

You have a UMAP plot of frozen embeddings colored by label
You computed and recorded a baseline silhouette score
You've saved both for later comparison

Exercise 0: Setup

Goal: Load the MAPK14 dataset and prepare for contrastive learning.

Download the data

Download the dataset here

Save the file as paired_positives.parquet in your working directory.

Load the parquet file

import pandas as pd

df = pd.read_parquet("paired_positives.parquet")
print(df.shape)
print(df.columns)
df.head()

Understand the structure

The dataset contains molecules tested against MAPK14 (a kinase target). Each row is a paired triplet:

positive: A molecule that did bind to the target (a "hit")
easy: A molecule that didn't bind, and is structurally different from the positive
hard: A molecule that didn't bind, but is structurally similar to the positive (shares 2 of 3 building blocks)

Why this structure matters for contrastive learning: This paired format is exactly what you need for building triplets! Each positive already comes with matched negatives. The easy negative is obviously different — any model can tell them apart. The hard negative is the real test: it looks like a binder but isn't. Learning to distinguish hard negatives forces the model to learn subtle, meaningful features.

Create classification datasets (for visualization)

You'll need long-format datasets for UMAP visualization:

# Create easy negatives dataset
df_easy = pd.concat([
    df[['positive']].rename(columns={'positive': 'smiles'}).assign(label=1),
    df[['easy']].rename(columns={'easy': 'smiles'}).assign(label=0)
]).reset_index(drop=True)

# Create hard negatives dataset
df_hard = pd.concat([
    df[['positive']].rename(columns={'positive': 'smiles'}).assign(label=1),
    df[['hard']].rename(columns={'hard': 'smiles'}).assign(label=0)
]).reset_index(drop=True)

print(f"Easy dataset: {len(df_easy)} molecules")
print(f"Hard dataset: {len(df_hard)} molecules")

Load your pre-trained model

Use the same model setup from R028 — load ChemBERTa and your get_embedding() function.

Important: Make sure you're loading the pre-trained weights here, not any fine-tuned weights from R028. We want to start fresh for contrastive learning so we can compare fairly.

Feeling lost? If model loading feels unfamiliar, revisit R028 Exercise 2 for the full setup code.

Success check:

Data loaded (paired format + classification format)
Pre-trained model loaded
get_embedding() function works on a test SMILES

Background: DEL Data and Contrastive Learning

Quick refresher on DEL: DNA-Encoded Libraries let you screen millions of molecules at once against a protein target. The result is enrichment data — signals about which molecules bind. But the signal is noisy, and structurally similar molecules can behave very differently.

In R028, you used standard classification to predict binders. This route tries a different approach: contrastive learning. Instead of learning "is this a binder?", you learn "which molecules are similar to each other?"

Before you start

Ask your chatbot to explain:

"What is contrastive learning and how does it differ from classification?"
"What is triplet loss and how does the margin parameter work?"
"Why might contrastive learning help when you have hard negatives?"

If you skipped R028, go back and do it first — you'll need that baseline to know if contrastive learning actually helps.

Why this route exists

In R028, you trained a classifier the standard way: predict binder vs non-binder, minimize cross-entropy loss. That works, but here's a question: what is the model actually learning?

When a classifier fails, the standard answer is "get more data" or "tune hyperparameters." But sometimes the problem is deeper — the model is learning the wrong thing entirely because the way you framed the problem didn't force it to learn the right thing.

This route is about a different way of framing learning problems: contrastive learning. Instead of asking "is this molecule a binder?", you ask "is this molecule more similar to binders or non-binders?" It sounds subtle, but it changes everything about what the model learns — and you'll see it directly in the geometry of the embedding space.

By the end, you'll know whether contrastive learning actually helps for this dataset — or if the standard approach from R028 was good enough.

What you'll be able to do after this route

By the end, you can:

Explain what contrastive learning is and why it's useful
Construct triplets (anchor, positive, negative) from labeled molecular data
Fine-tune a molecular encoder using triplet loss
Visualize AND quantify how representation learning reshapes embedding space
Compare contrastive learning to standard classification using AUROC and AU-PRC
Articulate why hard negatives make learning harder — and more meaningful

Key definitions

Contrastive Learning A family of methods that train models by comparing examples rather than classifying them. The goal is to learn an embedding space where similar things are close and dissimilar things are far apart.

Triplet Loss A contrastive loss function that takes three examples — an anchor, a positive (same class), and a negative (different class) — and trains the model to make the anchor closer to the positive than to the negative by at least a margin.

Hard Negatives Negative examples that are structurally similar to positives but don't share the positive label. Hard to distinguish, which forces the model to learn subtle, meaningful features.

Silhouette Score A metric from -1 to 1 measuring how well-clustered data is. Higher = tighter clusters, better separation. Use this to quantify embedding quality beyond "it looks better."

AU-PRC (Area Under Precision-Recall Curve) A classification metric better suited for imbalanced datasets than AUROC. Measures how well you're finding rare positives among many negatives.

Dataset credit: Karen Pu pre-processed this dataset from the KinDEL benchmark (Chen et al., 2025).

Route 030: Three of a Kind

RouteID: 030
Wall: The DEL Wall (W08)
Grade: 5.11a
Routesetters: Karen + Adrian
Time: ~1.5 hours
Dataset: MAPK14 from KinDEL benchmark (Chen et al., 2025)
Prerequisites: R028 (Your First Molecular Transformer). R031 is optional but useful for comparison.

UNDER CONSTRUCTION

This route is being actively built by Adrian and Karen. The dataset link and some details may change. Check back for updates!