🎉 Sent!
You made it to the top. Submit your work above!
Submission
Deliverables
Submit your completed notebook (.ipynb) with:
- Both files loaded and matched using Entry IDs
- Learning curve plots (small and large data regimes)
- Written responses to all questions
- Brief reflection (3–5 sentences)
Exercise 4: The Real Question — Why Is It So Good So Early?
Goal: Reflect deeply.
You likely observed:
- Test accuracy is already high with moderate data
- Performance stabilizes early
Write 3–5 sentences answering:
- Why is the model already performing well at small dataset sizes?
- What information might ESM embeddings already encode?
- Does this suggest membrane classification is:
- An easy task?
- A data-limited task?
- A feature-limited task?
Exercise 3: Scaling Up (Learning Curve)
Goal: Determine when performance stabilizes.
What is a learning curve?
A learning curve answers a simple question: how does my model improve as I give it more training data?
Think of it like learning to climb. Your first few sessions at the gym, you improve rapidly — suddenly you can hold on longer, read routes better, trust your feet. But after months of climbing, gains come slower. You might train for a year to shave seconds off your send time. The easy gains are gone.
Models work the same way. With 10 examples, a model is guessing. With 100, it starts picking up patterns. With 1,000, it might be pretty good. But at some point, more data stops helping — you've hit a plateau.
Why this matters: Learning curves tell you whether your problem is data-limited. If the curve is still climbing steeply, collecting more data will help. If it's flat, you need a better model or better features — more data won't save you.
What to look for
When you plot train and test accuracy vs. dataset size, watch for:
- The plateau: Where does test accuracy stop improving?
- The gap: How far apart are train and test accuracy? A big gap means overfitting. As you add data, the gap should shrink.
- The starting point: How good is the model with very little data? This tells you how much signal is already in your features.
Your task
- Use balanced sampling with: 50, 100, 200, 300, 400, 500, 600, 700, 800, max per class
- Plot:
- Train accuracy
- Test accuracy
Questions:
- Does performance continue increasing indefinitely, or does it plateau?
- Does the train/test gap shrink as you add more data?
Exercise 2: Mini Test (Small Data Regime)
Goal: Observe high variance behavior.
Why balanced sampling?
In this dataset, membrane and non-membrane proteins aren't equally common. If you randomly grab 50 proteins, you might get 40 of one class and 10 of the other — just by chance. That makes learning curves hard to interpret: is performance changing because you have more data, or because the class ratio shifted?
The key principle for learning curves is consistency: keep the class ratio constant across all training set sizes. That way, the only thing changing is the quantity of data, not the distribution.
What ratio should you use? It depends on your problem:
- If your real-world deployment uses imbalanced data (say, 90/10), your learning curve subsets should maintain that same 90/10 ratio at every size
- If you're studying a balanced problem, use 50/50 subsets
For this exercise, we use balanced sampling (50/50) because it's cleaner for learning — a model that just guesses the majority class can only get 50% accuracy, so there's no free lunch.
Important: Your test set should also stay consistent. If your evaluation set shifts as you build the curve, you can't tell what's changing — training or evaluation. Keep it fixed.
Your task
- Train logistic regression using: 5, 10, 20, 30, 40, 50, 60, 70, 80, 90 proteins per class
- Use balanced sampling (equal numbers of membrane and non-membrane)
- Plot:
- Training accuracy
- Test accuracy
Questions:
- What happens to training accuracy at very small sizes?
- Why is test accuracy unstable?
Exercise 1: Build Feature Matrix
Goal: Construct X and y.
- Keep only rows where Entry exists in
emb_dict - Build:
- X = stacked embeddings
- y = membrane column
Do not overcomplicate this.
Questions:
- What is the shape of X?
- What fraction of proteins are membrane proteins?
Exercise 0: The Knot Check (Data Integrity)
Goal: Confirm that metadata and embeddings align.
In your notebook:
- Load the CSV file
- Load the
.pklfile - Confirm how many entries exist in each
Hint for loading pickle files:
import pickle
with open("83333_complete_esm2.pkl", "rb") as f:
emb_dict = pickle.load(f)
Questions:
- How many proteins are in the CSV?
- How many embeddings are in the
.pklfile? - What is the dimension of each embedding vector?
- Do all CSV entries have embeddings?
- How can you correctly match each protein's embedding to its membrane label?
Success check:
- You confirm the number of matched proteins
Files You Need
Download BOTH files and upload them to your notebook:
- ecoli_membrane_dataset.csv (click to download)
- 83333_complete_esm2.pkl (click to download)
What's in the CSV?
| Column | Meaning |
|---|---|
| Entry | UniProt accession (unique protein ID) |
| Protein names | Descriptive protein name |
| Gene Names | Associated gene |
| Length | Sequence length |
| membrane | 1 = membrane protein, 0 = not membrane |
What's in the pickle file?
The embeddings file (.pkl) is a dictionary:
{
"P0A7V8": numpy array (embedding vector),
"P0A800": numpy array,
...
}
Keys = UniProt Entry IDs Values = ESM protein embeddings
Why this route exists
In machine learning, one of the most common questions is:
"How much data do I need?"
There is no universal answer. Instead of guessing, we measure.
In this route, you will:
- Use real protein embeddings from E. coli
- Classify membrane vs non-membrane proteins
- Train the same model on progressively larger subsets
- Construct learning curves
- Investigate the effect of class balance
- Decide whether we are data-limited or not
This route is about diagnosis, not just model training.
What you'll be able to do after this route
By the end, you can:
- Load
.pklembedding files - Match embeddings to protein metadata using UniProt Entry IDs
- Construct learning curves
- Interpret train/test gaps
- Reason about class imbalance
- Determine whether collecting more data will likely help
Route 029: Learning Curves — How Much Data Do I Need?
- RouteID: 029
- Wall: The Machine Learning Offwidth (W06)
- Grade: 5.8
- Routesetter: Sarah
- Time: ~30 minutes
- Dataset: E. coli proteome + ESM protein embeddings
🧗 Base Camp
Start here and climb your way up!