Navigate
Back to Gym
← Back to Wall

Learning Curves

Route ID: R029 • Wall: W06 • Released: Feb 19, 2026

5.8
ready

🎉 Sent!

You made it to the top. Submit your work above!

Submission

Submit your notebook here


Deliverables

Submit your completed notebook (.ipynb) with:

  1. Both files loaded and matched using Entry IDs
  2. Learning curve plots (small and large data regimes)
  3. Written responses to all questions
  4. Brief reflection (3–5 sentences)

Exercise 4: The Real Question — Why Is It So Good So Early?

Goal: Reflect deeply.

You likely observed:

  • Test accuracy is already high with moderate data
  • Performance stabilizes early

Write 3–5 sentences answering:

  1. Why is the model already performing well at small dataset sizes?
  2. What information might ESM embeddings already encode?
  3. Does this suggest membrane classification is:
    • An easy task?
    • A data-limited task?
    • A feature-limited task?

Exercise 3: Scaling Up (Learning Curve)

Goal: Determine when performance stabilizes.

What is a learning curve?

A learning curve answers a simple question: how does my model improve as I give it more training data?

Think of it like learning to climb. Your first few sessions at the gym, you improve rapidly — suddenly you can hold on longer, read routes better, trust your feet. But after months of climbing, gains come slower. You might train for a year to shave seconds off your send time. The easy gains are gone.

Models work the same way. With 10 examples, a model is guessing. With 100, it starts picking up patterns. With 1,000, it might be pretty good. But at some point, more data stops helping — you've hit a plateau.

Why this matters: Learning curves tell you whether your problem is data-limited. If the curve is still climbing steeply, collecting more data will help. If it's flat, you need a better model or better features — more data won't save you.

What to look for

When you plot train and test accuracy vs. dataset size, watch for:

  • The plateau: Where does test accuracy stop improving?
  • The gap: How far apart are train and test accuracy? A big gap means overfitting. As you add data, the gap should shrink.
  • The starting point: How good is the model with very little data? This tells you how much signal is already in your features.

Your task

  • Use balanced sampling with: 50, 100, 200, 300, 400, 500, 600, 700, 800, max per class
  • Plot:
    • Train accuracy
    • Test accuracy

Questions:

  1. Does performance continue increasing indefinitely, or does it plateau?
  2. Does the train/test gap shrink as you add more data?

Exercise 2: Mini Test (Small Data Regime)

Goal: Observe high variance behavior.

Why balanced sampling?

In this dataset, membrane and non-membrane proteins aren't equally common. If you randomly grab 50 proteins, you might get 40 of one class and 10 of the other — just by chance. That makes learning curves hard to interpret: is performance changing because you have more data, or because the class ratio shifted?

The key principle for learning curves is consistency: keep the class ratio constant across all training set sizes. That way, the only thing changing is the quantity of data, not the distribution.

What ratio should you use? It depends on your problem:

  • If your real-world deployment uses imbalanced data (say, 90/10), your learning curve subsets should maintain that same 90/10 ratio at every size
  • If you're studying a balanced problem, use 50/50 subsets

For this exercise, we use balanced sampling (50/50) because it's cleaner for learning — a model that just guesses the majority class can only get 50% accuracy, so there's no free lunch.

Important: Your test set should also stay consistent. If your evaluation set shifts as you build the curve, you can't tell what's changing — training or evaluation. Keep it fixed.

Your task

  • Train logistic regression using: 5, 10, 20, 30, 40, 50, 60, 70, 80, 90 proteins per class
  • Use balanced sampling (equal numbers of membrane and non-membrane)
  • Plot:
    • Training accuracy
    • Test accuracy

Questions:

  1. What happens to training accuracy at very small sizes?
  2. Why is test accuracy unstable?

Exercise 1: Build Feature Matrix

Goal: Construct X and y.

  • Keep only rows where Entry exists in emb_dict
  • Build:
    • X = stacked embeddings
    • y = membrane column

Do not overcomplicate this.

Questions:

  1. What is the shape of X?
  2. What fraction of proteins are membrane proteins?

Exercise 0: The Knot Check (Data Integrity)

Goal: Confirm that metadata and embeddings align.

In your notebook:

  • Load the CSV file
  • Load the .pkl file
  • Confirm how many entries exist in each

Hint for loading pickle files:

import pickle
with open("83333_complete_esm2.pkl", "rb") as f:
    emb_dict = pickle.load(f)

Questions:

  1. How many proteins are in the CSV?
  2. How many embeddings are in the .pkl file?
  3. What is the dimension of each embedding vector?
  4. Do all CSV entries have embeddings?
  5. How can you correctly match each protein's embedding to its membrane label?

Success check:

  • You confirm the number of matched proteins

Files You Need

Download BOTH files and upload them to your notebook:

  1. ecoli_membrane_dataset.csv (click to download)
  2. 83333_complete_esm2.pkl (click to download)

What's in the CSV?

ColumnMeaning
EntryUniProt accession (unique protein ID)
Protein namesDescriptive protein name
Gene NamesAssociated gene
LengthSequence length
membrane1 = membrane protein, 0 = not membrane

What's in the pickle file?

The embeddings file (.pkl) is a dictionary:

{
   "P0A7V8": numpy array (embedding vector),
   "P0A800": numpy array,
   ...
}

Keys = UniProt Entry IDs Values = ESM protein embeddings


Why this route exists

In machine learning, one of the most common questions is:

"How much data do I need?"

There is no universal answer. Instead of guessing, we measure.

In this route, you will:

  • Use real protein embeddings from E. coli
  • Classify membrane vs non-membrane proteins
  • Train the same model on progressively larger subsets
  • Construct learning curves
  • Investigate the effect of class balance
  • Decide whether we are data-limited or not

This route is about diagnosis, not just model training.

What you'll be able to do after this route

By the end, you can:

  • Load .pkl embedding files
  • Match embeddings to protein metadata using UniProt Entry IDs
  • Construct learning curves
  • Interpret train/test gaps
  • Reason about class imbalance
  • Determine whether collecting more data will likely help

Route 029: Learning Curves — How Much Data Do I Need?

  • RouteID: 029
  • Wall: The Machine Learning Offwidth (W06)
  • Grade: 5.8
  • Routesetter: Sarah
  • Time: ~30 minutes
  • Dataset: E. coli proteome + ESM protein embeddings

🧗 Base Camp

Start here and climb your way up!