Submission

Submit your notebook here

Deliverables

Submit your completed notebook (.ipynb) with:

Both files loaded and matched using Entry IDs
Learning curve plots (small and large data regimes)
Written responses to all questions
Brief reflection (3–5 sentences)

Exercise 4: The Real Question — Why Is It So Good So Early?

Goal: Reflect deeply.

You likely observed:

Test accuracy is already high with moderate data
Performance stabilizes early

Write 3–5 sentences answering:

Why is the model already performing well at small dataset sizes?
What information might ESM embeddings already encode?
Does this suggest membrane classification is:
- An easy task?
- A data-limited task?
- A feature-limited task?

Exercise 3: Scaling Up (Learning Curve)

Goal: Determine when performance stabilizes.

What is a learning curve?

A learning curve answers a simple question: how does my model improve as I give it more training data?

Think of it like learning to climb. Your first few sessions at the gym, you improve rapidly — suddenly you can hold on longer, read routes better, trust your feet. But after months of climbing, gains come slower. You might train for a year to shave seconds off your send time. The easy gains are gone.

Models work the same way. With 10 examples, a model is guessing. With 100, it starts picking up patterns. With 1,000, it might be pretty good. But at some point, more data stops helping — you've hit a plateau.

Why this matters: Learning curves tell you whether your problem is data-limited. If the curve is still climbing steeply, collecting more data will help. If it's flat, you need a better model or better features — more data won't save you.

What to look for

When you plot train and test accuracy vs. dataset size, watch for:

The plateau: Where does test accuracy stop improving?
The gap: How far apart are train and test accuracy? A big gap means overfitting. As you add data, the gap should shrink.
The starting point: How good is the model with very little data? This tells you how much signal is already in your features.

Your task

Use balanced sampling with: 50, 100, 200, 300, 400, 500, 600, 700, 800, max per class
Plot:
- Train accuracy
- Test accuracy

Questions:

Does performance continue increasing indefinitely, or does it plateau?
Does the train/test gap shrink as you add more data?

Exercise 2: Mini Test (Small Data Regime)

Goal: Observe high variance behavior.

Why balanced sampling?

In this dataset, membrane and non-membrane proteins aren't equally common. If you randomly grab 50 proteins, you might get 40 of one class and 10 of the other — just by chance. That makes learning curves hard to interpret: is performance changing because you have more data, or because the class ratio shifted?

The key principle for learning curves is consistency: keep the class ratio constant across all training set sizes. That way, the only thing changing is the quantity of data, not the distribution.

What ratio should you use? It depends on your problem:

If your real-world deployment uses imbalanced data (say, 90/10), your learning curve subsets should maintain that same 90/10 ratio at every size
If you're studying a balanced problem, use 50/50 subsets

For this exercise, we use balanced sampling (50/50) because it's cleaner for learning — a model that just guesses the majority class can only get 50% accuracy, so there's no free lunch.

Important: Your test set should also stay consistent. If your evaluation set shifts as you build the curve, you can't tell what's changing — training or evaluation. Keep it fixed.

Your task

Train logistic regression using: 5, 10, 20, 30, 40, 50, 60, 70, 80, 90 proteins per class
Use balanced sampling (equal numbers of membrane and non-membrane)
Plot:
- Training accuracy
- Test accuracy

Questions:

What happens to training accuracy at very small sizes?
Why is test accuracy unstable?

Exercise 1: Build Feature Matrix

Goal: Construct X and y.

Keep only rows where Entry exists in emb_dict
Build:
- X = stacked embeddings
- y = membrane column

Do not overcomplicate this.

Questions:

What is the shape of X?
What fraction of proteins are membrane proteins?

Exercise 0: The Knot Check (Data Integrity)

Goal: Confirm that metadata and embeddings align.

In your notebook:

Load the CSV file
Load the .pkl file
Confirm how many entries exist in each

Hint for loading pickle files:

import pickle
with open("83333_complete_esm2.pkl", "rb") as f:
    emb_dict = pickle.load(f)

Questions:

How many proteins are in the CSV?
How many embeddings are in the .pkl file?
What is the dimension of each embedding vector?
Do all CSV entries have embeddings?
How can you correctly match each protein's embedding to its membrane label?

Success check:

You confirm the number of matched proteins

Files You Need

Download BOTH files and upload them to your notebook:

ecoli_membrane_dataset.csv (click to download)
83333_complete_esm2.pkl (click to download)

What's in the CSV?

Column	Meaning
Entry	UniProt accession (unique protein ID)
Protein names	Descriptive protein name
Gene Names	Associated gene
Length	Sequence length
membrane	1 = membrane protein, 0 = not membrane

What's in the pickle file?

The embeddings file (.pkl) is a dictionary:

{
   "P0A7V8": numpy array (embedding vector),
   "P0A800": numpy array,
   ...
}

Keys = UniProt Entry IDs Values = ESM protein embeddings

Why this route exists

In machine learning, one of the most common questions is:

"How much data do I need?"

There is no universal answer. Instead of guessing, we measure.

In this route, you will:

Use real protein embeddings from E. coli
Classify membrane vs non-membrane proteins
Train the same model on progressively larger subsets
Construct learning curves
Investigate the effect of class balance
Decide whether we are data-limited or not

This route is about diagnosis, not just model training.

What you'll be able to do after this route

By the end, you can:

Load .pkl embedding files
Match embeddings to protein metadata using UniProt Entry IDs
Construct learning curves
Interpret train/test gaps
Reason about class imbalance
Determine whether collecting more data will likely help

Route 029: Learning Curves — How Much Data Do I Need?

RouteID: 029
Wall: The Machine Learning Offwidth (W06)
Grade: 5.8
Routesetter: Sarah
Time: ~30 minutes
Dataset: E. coli proteome + ESM protein embeddings