Navigate
Back to Gym
← Back to Wall

Unfreezing the Transformer

Route ID: R031 • Wall: W08 • Released: Feb 24, 2026

5.10c
ready

🎉 Sent!

You made it to the top. Submit your work above!

Submission

Submit your notebook here


Deliverables

Submit your completed notebook (.ipynb) with:

  1. Fine-tuning setup and training loop (Exercise 1)
  2. Training loss curve (Exercise 2)
  3. AUROC and AU-PRC on easy and hard negatives (Exercise 3)
  4. Comparison table: frozen vs fine-tuned (Exercise 4)
  5. Reflection answers in markdown cells (Exercise 5)

Exercise 5: Reflection

Goal: Think about what you learned.

Answer in your notebook (2-3 sentences each):

  1. Did fine-tuning improve AUROC and AU-PRC compared to frozen embeddings? By how much?

  2. How much longer did fine-tuning take compared to frozen embeddings? Was the improvement worth the extra compute?

  3. Did fine-tuning help more on easy or hard negatives? Why might that be?

  4. What's the risk of fine-tuning a pre-trained model on a small dataset? (Hint: think about what the model might "forget")

  5. In the next route (R030), you'll try contrastive learning. Do you think it will beat fine-tuning? Why or why not?


Exercise 4: The Comparison

Goal: See if fine-tuning was worth it.

Fill in this comparison table with your results:

ApproachEasy AUROCEasy AU-PRCHard AUROCHard AU-PRC
R028 Frozen embeddings????????????
R029 Fine-tuned (this route)????????????

Questions:

  • Did fine-tuning improve performance? By how much?
  • Did it help more on easy or hard negatives?
  • How much extra time/compute did fine-tuning require?
  • Is the improvement worth the cost?

Success check:

  • You have numbers for both approaches
  • You can articulate whether fine-tuning was worth it

Exercise 3: Evaluation

Goal: Measure performance with AUROC and AU-PRC.

Use the same evaluation approach from R028:

  • Split your data (or use cross-validation)
  • Compute AUROC and AU-PRC on both easy and hard negatives
  • Record your results for comparison

Ask your chatbot:

"How do I evaluate a fine-tuned transformer classifier with AUROC and AU-PRC?"

Feeling lost? Revisit R028 Exercise 4 for the evaluation setup.

Success check:

  • AUROC and AU-PRC computed for easy negatives
  • AUROC and AU-PRC computed for hard negatives
  • Ready to compare in Exercise 4

Exercise 2: Training

Goal: Fine-tune and monitor the loss.

Train for 3-5 epochs. Watch your loss curve — it should decrease. If it doesn't, something's wrong.

Tips:

  • Use a small learning rate (1e-5 to 5e-5) — you're fine-tuning, not training from scratch
  • Save checkpoints so you don't lose progress if Colab disconnects
  • Monitor GPU memory — reduce batch size if you run out

Ask your chatbot:

"How do I save and load model checkpoints in PyTorch?"

Plot your training loss over batches or epochs.

Questions:

  • How long did each epoch take?
  • Did the loss decrease smoothly or was it noisy?
  • What batch size could you fit in GPU memory?

Success check:

  • Training completes without crashing
  • Loss decreases over epochs
  • You have a loss curve to include in your submission

Exercise 1: Fine-Tuning Setup

Goal: Set up end-to-end fine-tuning of the molecular transformer.

In R028, you froze the transformer and only trained a classifier on top. Now you'll unfreeze it — the transformer weights will update during training, allowing it to learn task-specific representations.

The architecture

SMILES → Transformer → [CLS] embedding → Classification head → P(binder)
        |__________________________|     |__________________________|
                    ↓                                ↓
           UNFROZEN (updates!)              Also updates
           Learns binding-specific          Learns to predict
           representations                  from representations

What you need

  1. Your pre-trained model from R028
  2. A classification head — a simple linear layer on top
  3. An optimizer that updates ALL weights (transformer + head)
  4. A loss function — binary cross-entropy works well

Ask your chatbot:

"How do I add a classification head to a HuggingFace transformer and fine-tune it end-to-end with PyTorch?"

You can also explore HuggingFace's Trainer API, which handles a lot of the boilerplate:

"How do I use HuggingFace Trainer for binary classification?"

Key differences from frozen embeddings

Frozen (R028)Fine-tuned (this route)
Transformer weights fixedTransformer weights update
Extract embeddings onceRe-embed every batch
Train sklearn classifierTrain PyTorch model
Fast, CPU-friendlySlow, needs GPU
Model stays generalModel becomes task-specific

Questions:

  • Why use a smaller learning rate for fine-tuning than training from scratch?
  • What's the risk of "catastrophic forgetting" when fine-tuning?
  • Why might fine-tuning help more on hard negatives than easy ones?

Success check:

  • Model loads with classification head attached
  • Optimizer is set up to update all parameters
  • Ready to train in Exercise 2

Background: Why Fine-Tune?

In R028, you used the transformer as a frozen feature extractor. This is fast and works surprisingly well — the pre-trained model already knows a lot about molecular structure.

But there's a limitation: the representations weren't learned for your task. The model was pre-trained on general molecular data, not on binding to MAPK14. Fine-tuning lets the model adapt its representations to be more useful for your specific classification problem.

The tradeoff:

  • Frozen: Fast, stable, works on CPU. But representations are generic.
  • Fine-tuned: Slow, needs GPU, risk of overfitting. But representations are task-specific.

Ask your chatbot:

"What is the difference between feature extraction and fine-tuning in transfer learning?" "What is catastrophic forgetting and how do I prevent it when fine-tuning?"


Why this route exists

R028 gave you a baseline with frozen embeddings. But you might be wondering: what if we actually updated the transformer weights? Would the model learn better representations for our specific task?

That's what this route explores. You'll fine-tune the entire model end-to-end and see if it beats frozen embeddings.

Spoiler: Fine-tuning often helps, but not always. Sometimes frozen embeddings are good enough, and fine-tuning just wastes compute (or worse, overfits). You'll find out which is true for this dataset.

What you'll be able to do after this route

By the end, you can:

  • Fine-tune a HuggingFace transformer for binary classification
  • Set up a PyTorch training loop with proper learning rates
  • Compare frozen vs fine-tuned approaches quantitatively
  • Articulate when fine-tuning is worth the extra compute

Key definitions

Fine-tuning Updating the weights of a pre-trained model on your specific task. The model "adapts" its representations to be more useful for your problem.

Catastrophic Forgetting When fine-tuning causes a model to "forget" useful knowledge from pre-training. Mitigated by using small learning rates and not training too long.

Learning Rate How big of a step the optimizer takes. For fine-tuning, use small values (1e-5 to 5e-5) to avoid destroying pre-trained knowledge.

Dataset credit: Karen Pu pre-processed this dataset from the KinDEL benchmark (Chen et al., 2025).


Route 029: Unfreezing the Transformer

  • RouteID: 031
  • Wall: The DEL Wall (W08)
  • Grade: 5.10c
  • Routesetters: Karen + Adrian
  • Time: ~1.5 hours
  • Dataset: MAPK14 from KinDEL benchmark (Chen et al., 2025)
  • Prerequisites: R028 (Your First Molecular Transformer)

WARNING: COMPUTE-INTENSIVE

This route requires a GPU. Free Google Colab might work, but you may run into memory limits or disconnections. Colab Pro is recommended.

Connect to a GPU in Colab: Go to Runtime → Change runtime type → Hardware accelerator → GPU (T4)

If you can't get fine-tuning to work, that's okay! Skip to R030 (contrastive learning) and come back to this one later when you have better compute access. The important thing is understanding why fine-tuning might help, even if you can't run it yourself.

Alternatives if you're stuck:

  • Use a smaller model (e.g., DistilBERT-based molecular models)
  • Reduce batch size to 4 or 8
  • Train for fewer epochs
  • Use gradient checkpointing to save memory

🧗 Base Camp

Start here and climb your way up!