Submission

Submit your notebook here

Deliverables

Submit your completed notebook (.ipynb) with:

Fine-tuning setup and training loop (Exercise 1)
Training loss curve (Exercise 2)
AUROC and AU-PRC on easy and hard negatives (Exercise 3)
Comparison table: frozen vs fine-tuned (Exercise 4)
Reflection answers in markdown cells (Exercise 5)

Exercise 5: Reflection

Goal: Think about what you learned.

Answer in your notebook (2-3 sentences each):

Did fine-tuning improve AUROC and AU-PRC compared to frozen embeddings? By how much?
How much longer did fine-tuning take compared to frozen embeddings? Was the improvement worth the extra compute?
Did fine-tuning help more on easy or hard negatives? Why might that be?
What's the risk of fine-tuning a pre-trained model on a small dataset? (Hint: think about what the model might "forget")
In the next route (R030), you'll try contrastive learning. Do you think it will beat fine-tuning? Why or why not?

Exercise 4: The Comparison

Goal: See if fine-tuning was worth it.

Fill in this comparison table with your results:

Approach	Easy AUROC	Easy AU-PRC	Hard AUROC	Hard AU-PRC
R028 Frozen embeddings	???	???	???	???
R029 Fine-tuned (this route)	???	???	???	???

Questions:

Did fine-tuning improve performance? By how much?
Did it help more on easy or hard negatives?
How much extra time/compute did fine-tuning require?
Is the improvement worth the cost?

Success check:

You have numbers for both approaches
You can articulate whether fine-tuning was worth it

Exercise 3: Evaluation

Goal: Measure performance with AUROC and AU-PRC.

Use the same evaluation approach from R028:

Split your data (or use cross-validation)
Compute AUROC and AU-PRC on both easy and hard negatives
Record your results for comparison

Ask your chatbot:

"How do I evaluate a fine-tuned transformer classifier with AUROC and AU-PRC?"

Feeling lost? Revisit R028 Exercise 4 for the evaluation setup.

Success check:

AUROC and AU-PRC computed for easy negatives
AUROC and AU-PRC computed for hard negatives
Ready to compare in Exercise 4

Exercise 2: Training

Goal: Fine-tune and monitor the loss.

Train for 3-5 epochs. Watch your loss curve — it should decrease. If it doesn't, something's wrong.

Tips:

Use a small learning rate (1e-5 to 5e-5) — you're fine-tuning, not training from scratch
Save checkpoints so you don't lose progress if Colab disconnects
Monitor GPU memory — reduce batch size if you run out

Ask your chatbot:

"How do I save and load model checkpoints in PyTorch?"

Plot your training loss over batches or epochs.

Questions:

How long did each epoch take?
Did the loss decrease smoothly or was it noisy?
What batch size could you fit in GPU memory?

Success check:

Training completes without crashing
Loss decreases over epochs
You have a loss curve to include in your submission

Exercise 1: Fine-Tuning Setup

Goal: Set up end-to-end fine-tuning of the molecular transformer.

In R028, you froze the transformer and only trained a classifier on top. Now you'll unfreeze it — the transformer weights will update during training, allowing it to learn task-specific representations.

The architecture

SMILES → Transformer → [CLS] embedding → Classification head → P(binder)
        |__________________________|     |__________________________|
                    ↓                                ↓
           UNFROZEN (updates!)              Also updates
           Learns binding-specific          Learns to predict
           representations                  from representations

What you need

Your pre-trained model from R028
A classification head — a simple linear layer on top
An optimizer that updates ALL weights (transformer + head)
A loss function — binary cross-entropy works well

Ask your chatbot:

"How do I add a classification head to a HuggingFace transformer and fine-tune it end-to-end with PyTorch?"

You can also explore HuggingFace's Trainer API, which handles a lot of the boilerplate:

"How do I use HuggingFace Trainer for binary classification?"

Key differences from frozen embeddings

Frozen (R028)	Fine-tuned (this route)
Transformer weights fixed	Transformer weights update
Extract embeddings once	Re-embed every batch
Train sklearn classifier	Train PyTorch model
Fast, CPU-friendly	Slow, needs GPU
Model stays general	Model becomes task-specific

Questions:

Why use a smaller learning rate for fine-tuning than training from scratch?
What's the risk of "catastrophic forgetting" when fine-tuning?
Why might fine-tuning help more on hard negatives than easy ones?

Success check:

Model loads with classification head attached
Optimizer is set up to update all parameters
Ready to train in Exercise 2

Background: Why Fine-Tune?

In R028, you used the transformer as a frozen feature extractor. This is fast and works surprisingly well — the pre-trained model already knows a lot about molecular structure.

But there's a limitation: the representations weren't learned for your task. The model was pre-trained on general molecular data, not on binding to MAPK14. Fine-tuning lets the model adapt its representations to be more useful for your specific classification problem.

The tradeoff:

Frozen: Fast, stable, works on CPU. But representations are generic.
Fine-tuned: Slow, needs GPU, risk of overfitting. But representations are task-specific.

Ask your chatbot:

"What is the difference between feature extraction and fine-tuning in transfer learning?" "What is catastrophic forgetting and how do I prevent it when fine-tuning?"

Why this route exists

R028 gave you a baseline with frozen embeddings. But you might be wondering: what if we actually updated the transformer weights? Would the model learn better representations for our specific task?

That's what this route explores. You'll fine-tune the entire model end-to-end and see if it beats frozen embeddings.

Spoiler: Fine-tuning often helps, but not always. Sometimes frozen embeddings are good enough, and fine-tuning just wastes compute (or worse, overfits). You'll find out which is true for this dataset.

What you'll be able to do after this route

By the end, you can:

Fine-tune a HuggingFace transformer for binary classification
Set up a PyTorch training loop with proper learning rates
Compare frozen vs fine-tuned approaches quantitatively
Articulate when fine-tuning is worth the extra compute

Key definitions

Fine-tuning Updating the weights of a pre-trained model on your specific task. The model "adapts" its representations to be more useful for your problem.

Catastrophic Forgetting When fine-tuning causes a model to "forget" useful knowledge from pre-training. Mitigated by using small learning rates and not training too long.

Learning Rate How big of a step the optimizer takes. For fine-tuning, use small values (1e-5 to 5e-5) to avoid destroying pre-trained knowledge.

Dataset credit: Karen Pu pre-processed this dataset from the KinDEL benchmark (Chen et al., 2025).

Route 029: Unfreezing the Transformer

RouteID: 031
Wall: The DEL Wall (W08)
Grade: 5.10c
Routesetters: Karen + Adrian
Time: ~1.5 hours
Dataset: MAPK14 from KinDEL benchmark (Chen et al., 2025)
Prerequisites: R028 (Your First Molecular Transformer)

WARNING: COMPUTE-INTENSIVE

This route requires a GPU. Free Google Colab might work, but you may run into memory limits or disconnections. Colab Pro is recommended.

Connect to a GPU in Colab: Go to Runtime → Change runtime type → Hardware accelerator → GPU (T4)

If you can't get fine-tuning to work, that's okay! Skip to R030 (contrastive learning) and come back to this one later when you have better compute access. The important thing is understanding why fine-tuning might help, even if you can't run it yourself.

Alternatives if you're stuck:

Use a smaller model (e.g., DistilBERT-based molecular models)

Reduce batch size to 4 or 8

Train for fewer epochs

Use gradient checkpointing to save memory