🎉 Sent!
You made it to the top. Submit your work above!
Submission
Deliverables
Submit your completed notebook (.ipynb) with:
- Fine-tuning setup and training loop (Exercise 1)
- Training loss curve (Exercise 2)
- AUROC and AU-PRC on easy and hard negatives (Exercise 3)
- Comparison table: frozen vs fine-tuned (Exercise 4)
- Reflection answers in markdown cells (Exercise 5)
Exercise 5: Reflection
Goal: Think about what you learned.
Answer in your notebook (2-3 sentences each):
-
Did fine-tuning improve AUROC and AU-PRC compared to frozen embeddings? By how much?
-
How much longer did fine-tuning take compared to frozen embeddings? Was the improvement worth the extra compute?
-
Did fine-tuning help more on easy or hard negatives? Why might that be?
-
What's the risk of fine-tuning a pre-trained model on a small dataset? (Hint: think about what the model might "forget")
-
In the next route (R030), you'll try contrastive learning. Do you think it will beat fine-tuning? Why or why not?
Exercise 4: The Comparison
Goal: See if fine-tuning was worth it.
Fill in this comparison table with your results:
| Approach | Easy AUROC | Easy AU-PRC | Hard AUROC | Hard AU-PRC |
|---|---|---|---|---|
| R028 Frozen embeddings | ??? | ??? | ??? | ??? |
| R029 Fine-tuned (this route) | ??? | ??? | ??? | ??? |
Questions:
- Did fine-tuning improve performance? By how much?
- Did it help more on easy or hard negatives?
- How much extra time/compute did fine-tuning require?
- Is the improvement worth the cost?
Success check:
- You have numbers for both approaches
- You can articulate whether fine-tuning was worth it
Exercise 3: Evaluation
Goal: Measure performance with AUROC and AU-PRC.
Use the same evaluation approach from R028:
- Split your data (or use cross-validation)
- Compute AUROC and AU-PRC on both easy and hard negatives
- Record your results for comparison
Ask your chatbot:
"How do I evaluate a fine-tuned transformer classifier with AUROC and AU-PRC?"
Feeling lost? Revisit R028 Exercise 4 for the evaluation setup.
Success check:
- AUROC and AU-PRC computed for easy negatives
- AUROC and AU-PRC computed for hard negatives
- Ready to compare in Exercise 4
Exercise 2: Training
Goal: Fine-tune and monitor the loss.
Train for 3-5 epochs. Watch your loss curve — it should decrease. If it doesn't, something's wrong.
Tips:
- Use a small learning rate (1e-5 to 5e-5) — you're fine-tuning, not training from scratch
- Save checkpoints so you don't lose progress if Colab disconnects
- Monitor GPU memory — reduce batch size if you run out
Ask your chatbot:
"How do I save and load model checkpoints in PyTorch?"
Plot your training loss over batches or epochs.
Questions:
- How long did each epoch take?
- Did the loss decrease smoothly or was it noisy?
- What batch size could you fit in GPU memory?
Success check:
- Training completes without crashing
- Loss decreases over epochs
- You have a loss curve to include in your submission
Exercise 1: Fine-Tuning Setup
Goal: Set up end-to-end fine-tuning of the molecular transformer.
In R028, you froze the transformer and only trained a classifier on top. Now you'll unfreeze it — the transformer weights will update during training, allowing it to learn task-specific representations.
The architecture
SMILES → Transformer → [CLS] embedding → Classification head → P(binder)
|__________________________| |__________________________|
↓ ↓
UNFROZEN (updates!) Also updates
Learns binding-specific Learns to predict
representations from representations
What you need
- Your pre-trained model from R028
- A classification head — a simple linear layer on top
- An optimizer that updates ALL weights (transformer + head)
- A loss function — binary cross-entropy works well
Ask your chatbot:
"How do I add a classification head to a HuggingFace transformer and fine-tune it end-to-end with PyTorch?"
You can also explore HuggingFace's Trainer API, which handles a lot of the boilerplate:
"How do I use HuggingFace Trainer for binary classification?"
Key differences from frozen embeddings
| Frozen (R028) | Fine-tuned (this route) |
|---|---|
| Transformer weights fixed | Transformer weights update |
| Extract embeddings once | Re-embed every batch |
| Train sklearn classifier | Train PyTorch model |
| Fast, CPU-friendly | Slow, needs GPU |
| Model stays general | Model becomes task-specific |
Questions:
- Why use a smaller learning rate for fine-tuning than training from scratch?
- What's the risk of "catastrophic forgetting" when fine-tuning?
- Why might fine-tuning help more on hard negatives than easy ones?
Success check:
- Model loads with classification head attached
- Optimizer is set up to update all parameters
- Ready to train in Exercise 2
Background: Why Fine-Tune?
In R028, you used the transformer as a frozen feature extractor. This is fast and works surprisingly well — the pre-trained model already knows a lot about molecular structure.
But there's a limitation: the representations weren't learned for your task. The model was pre-trained on general molecular data, not on binding to MAPK14. Fine-tuning lets the model adapt its representations to be more useful for your specific classification problem.
The tradeoff:
- Frozen: Fast, stable, works on CPU. But representations are generic.
- Fine-tuned: Slow, needs GPU, risk of overfitting. But representations are task-specific.
Ask your chatbot:
"What is the difference between feature extraction and fine-tuning in transfer learning?" "What is catastrophic forgetting and how do I prevent it when fine-tuning?"
Why this route exists
R028 gave you a baseline with frozen embeddings. But you might be wondering: what if we actually updated the transformer weights? Would the model learn better representations for our specific task?
That's what this route explores. You'll fine-tune the entire model end-to-end and see if it beats frozen embeddings.
Spoiler: Fine-tuning often helps, but not always. Sometimes frozen embeddings are good enough, and fine-tuning just wastes compute (or worse, overfits). You'll find out which is true for this dataset.
What you'll be able to do after this route
By the end, you can:
- Fine-tune a HuggingFace transformer for binary classification
- Set up a PyTorch training loop with proper learning rates
- Compare frozen vs fine-tuned approaches quantitatively
- Articulate when fine-tuning is worth the extra compute
Key definitions
Fine-tuning Updating the weights of a pre-trained model on your specific task. The model "adapts" its representations to be more useful for your problem.
Catastrophic Forgetting When fine-tuning causes a model to "forget" useful knowledge from pre-training. Mitigated by using small learning rates and not training too long.
Learning Rate How big of a step the optimizer takes. For fine-tuning, use small values (1e-5 to 5e-5) to avoid destroying pre-trained knowledge.
Dataset credit: Karen Pu pre-processed this dataset from the KinDEL benchmark (Chen et al., 2025).
Route 029: Unfreezing the Transformer
- RouteID: 031
- Wall: The DEL Wall (W08)
- Grade: 5.10c
- Routesetters: Karen + Adrian
- Time: ~1.5 hours
- Dataset: MAPK14 from KinDEL benchmark (Chen et al., 2025)
- Prerequisites: R028 (Your First Molecular Transformer)
WARNING: COMPUTE-INTENSIVE
This route requires a GPU. Free Google Colab might work, but you may run into memory limits or disconnections. Colab Pro is recommended.
Connect to a GPU in Colab: Go to Runtime → Change runtime type → Hardware accelerator → GPU (T4)
If you can't get fine-tuning to work, that's okay! Skip to R030 (contrastive learning) and come back to this one later when you have better compute access. The important thing is understanding why fine-tuning might help, even if you can't run it yourself.
Alternatives if you're stuck:
- Use a smaller model (e.g., DistilBERT-based molecular models)
- Reduce batch size to 4 or 8
- Train for fewer epochs
- Use gradient checkpointing to save memory
🧗 Base Camp
Start here and climb your way up!