Route 017: Your First Classifier

RouteID: 017
Wall: The Machine Learning Offwidth (W06)
Grade: 5.7
Routesetter: Adrian
Time: ~40 minutes
Dataset: M. tuberculosis proteome with sequences and cellular location labels

Why this route exists

You've worked with protein data. You've computed similarities and made visualizations. But you haven't yet asked a computer to learn something from data and make predictions.

That changes now.

In this route, you'll train your first machine learning classifier. Given information about a protein, the model will learn to predict where in the cell that protein lives. But here's the key insight: machine learning models only understand numbers. You can't feed a model a protein sequence directly. You have to describe that sequence numerically first.

This process of turning raw data into numbers is called featurization, and the numbers you create are called features. Think of it like describing a person to someone who's never seen them: you might say "tall, brown hair, glasses." Those are features. For proteins, you might say "500 amino acids long, lots of cysteines, very hydrophobic." Same idea.

In this route, you'll build features yourself, starting from raw sequences. By the end, you'll appreciate what goes into turning biology into numbers.

Setting expectations: This route is designed to get you through one clean iteration of the ML workflow as quickly as possible. Real-world ML is messier: datasets are imbalanced, features need more thought, models overfit, evaluation is tricky. We'll tackle all of that in later routes. For now, focus on the basic rhythm: features → split → train → predict → evaluate.

What you'll be able to do after this route

By the end, you can:

Extract features from protein sequences (length, amino acid counts)
Understand what "featurization" means and why it matters
Split data into training and test sets
Train a logistic regression classifier using scikit-learn
Evaluate accuracy and interpret what it means

Key definitions

Classifier A model that predicts which category something belongs to. (As opposed to regression, which predicts a continuous number.)

Features The input variables the model uses to make predictions. You have to decide what these are! In this route, you'll build them from sequences.

Featurization The process of turning raw data (like a protein sequence) into numbers a model can use.

Labels The thing you're trying to predict. In this route: cellular location (membrane, cytoplasm, or secreted).

Training set Data the model learns from.

Test set Data the model has never seen, used to evaluate how well it generalizes.

Exercise 0: Load the Data

Goal: Load the M. tuberculosis proteome with sequences and cellular location annotations.

Download the dataset: mtb_with_localization.xlsx
Load it into a pandas DataFrame (hint: pd.read_excel())
Inspect the columns and the first few rows

Success check:

You can print the first few rows
You see columns including Entry, Sequence, and Subcellular location [CC]

Exercise 1: Clean the Labels (Welcome to Real Data)

Goal: Parse the messy localization column into clean labels.

Take a look at the Subcellular location [CC] column. It contains entries like:

"SUBCELLULAR LOCATION: Cell membrane {ECO:0000255|PROSITE-ProRule:PRU00303}; Lipid-anchor..."
"SUBCELLULAR LOCATION: Cytoplasm {ECO:0000305}."
"SUBCELLULAR LOCATION: Secreted {ECO:0000269|PubMed:10986245}. Host cytoplasm..."

This is what real biological data looks like! Your job is to parse this into clean categories. Let's simplify to three labels: membrane, cytoplasm, and secreted.

Write a function that takes a localization string and returns one of: "membrane", "cytoplasm", "secreted", or None (if unclear)
Apply it to create a new location column
Drop rows where location is None
Check: how many proteins in each category?

Hints:

Look for keywords like "membrane", "cytoplasm", "secreted" in the string
Python's in operator checks if a substring exists: "membrane" in text.lower()
Some proteins have multiple locations listed. Pick the first one that matches, or use your judgment.
It's okay if your parsing isn't perfect. This is a learning exercise, not a research paper.

Success check:

You have a clean location column with three categories
You know how many proteins are in each category

Exercise 2: Your First Feature (Length)

Goal: Create a simple feature from the sequence.

The simplest thing you can compute from a sequence is its length. Let's start there.

Create a new column length that contains the length of each sequence
Print the min, max, and mean length
Create your feature matrix X containing just this one column, and your label vector y containing the location column

Hints:

The pandas .apply() method lets you run a function on every row. Python's len() function returns the length of a string.
To select one column as a vector: df['column_name']
To select one column but keep it as a DataFrame (which sklearn prefers for X): df[['column_name']] (note the double brackets)

Success check:

You have a length column
X has shape (n_proteins, 1)
y has shape (n_proteins,)

Exercise 3: Build More Features (Amino Acid Counts)

Goal: Create richer features by counting amino acids.

Length alone probably isn't enough. What else can we extract from a sequence? One idea: count how often each amino acid appears. A protein with lots of hydrophobic residues might behave differently than one with lots of charged residues.

Write a function that takes a sequence and returns a dictionary of amino acid counts (one count per amino acid: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y)
Apply this function to all sequences and add the counts as new columns in your DataFrame
Update your feature matrix X to include length AND all 20 amino acid counts

Hint: Python strings have a .count() method. You can use .apply() with a function that returns a dictionary, then expand it into columns.

Questions:

How many features do you have now?
Why might amino acid composition help predict cellular location?

Biology nudge: If you're not sure what a membrane is or why its proteins might be different, no worries! Go to your favorite chatbot and try prompts like:

"What is a cell membrane made of?"
"Why do membrane proteins have lots of hydrophobic amino acids?"

Take 5 minutes to learn, then come back.

Success check:

You have 21 features (length + 20 amino acids)
X has shape (n_proteins, 21)

Exercise 4: Split the Data

Goal: Divide your data into training and test sets.

Before training, you need to set aside some data for testing. The model will learn from the training set, and you'll evaluate it on the test set (data it has never seen).

Use scikit-learn's train_test_split to split X and y into training (80%) and test (20%) sets
Use random_state=42 so your results are reproducible
Print the sizes of your train and test sets

Hint: Look up sklearn.model_selection.train_test_split.

Success check:

Training set is ~80% of the data
Test set is ~20% of the data
You have four variables: X_train, X_test, y_train, y_test

Exercise 5: Train the Model

Goal: Train a logistic regression classifier on your training data.

Import LogisticRegression from scikit-learn
Create a model instance (you may need to set max_iter=1000 to ensure it converges)
Call .fit() on your training data

That's it. After fitting, the model has learned patterns from your hand-crafted features.

Hint: Look up sklearn.linear_model.LogisticRegression.

Success check:

No errors when fitting
You understand that .fit() is where learning happens

Exercise 6: Predict and Evaluate

Goal: See how well your features work.

Use your trained model to make predictions on the test set (.predict())
Compare predictions to the true labels using accuracy_score from scikit-learn
Print the accuracy as a percentage

Questions to answer:

What accuracy did you get?
If you guessed the most common class every time, what accuracy would you get? (This is called the "baseline.")
Is your model doing better than the baseline?

Hint: Look up sklearn.metrics.accuracy_score.

Success check:

You computed an accuracy score
You can explain whether the model is useful or not

Exercise 7: Bonus Challenge (Optional)

Goal: See if you can improve accuracy with better features.

Try one or more of these:

Normalize amino acid counts by sequence length (frequencies instead of raw counts)
Add features for specific amino acid properties (e.g., count of hydrophobic residues)
Remove features that don't seem helpful

Does your accuracy improve?

Exercise 8: Reflection

Goal: Consolidate what you learned.

Answer in your notebook (1-2 sentences each):

What does fit() do? What does predict() do?
Why do we split data into train and test sets?
You built features by hand (length, amino acid counts). What other features could you imagine extracting from a protein sequence?
Sneak preview: In the Protein Representations wall, you used embeddings from protein language models (PLMs). Those embeddings are 1024-dimensional vectors that capture patterns learned from millions of sequences. How do you think PLM embeddings compare to the hand-crafted features you built today?

Deliverables

Submit your completed notebook (.ipynb) with:

All code cells executed
Your accuracy score clearly displayed
Reflection answers in markdown cells

Submission

Submit your notebook here