Route 018: Choose Your Fighter

RouteID: 018
Wall: The Machine Learning Offwidth (W06)
Grade: 5.7
Routesetter: Adrian
Time: ~25 minutes
Dataset: M. tuberculosis proteome with cellular location labels

Why this route exists

In R017, you trained a logistic regression model. But logistic regression is just one tool in the toolbox. There are other models — with different strengths and weaknesses — that might work better for your problem.

In this route, you'll meet the three workhorses of classical machine learning:

Logistic Regression — fast, interpretable, works well when relationships are simple
Random Forest — an ensemble of decision trees, handles complex patterns
XGBoost — boosted trees, often the champion on tabular data

These are your quick wins. When your PI says "can we predict X from Y?", these are the first tools you reach for. No GPUs required, no deep learning complexity, just solid models that work on tabular data. In many real-world scenarios, one of these three will be your final model. Master them now, and you'll always have a strong starting point.

You'll train all three on the same dataset and see which one wins.

What you'll be able to do after this route

By the end, you can:

Train three different classifiers on the same data
Compare model performance using accuracy
Explain the basic intuition behind each model type
Make an informed choice about which model to use

Key definitions

Logistic Regression A linear model that draws a straight boundary between classes. Fast and interpretable, but struggles with complex patterns.

Random Forest An ensemble of many decision trees, each trained on a random subset of data. Combines their votes for the final prediction. Robust and handles non-linear relationships.

XGBoost Boosted decision trees — each new tree focuses on fixing the mistakes of the previous ones. Often the top performer on structured/tabular data.

Ensemble A model that combines multiple simpler models to make better predictions.

Exercise 0: Setup

Goal: Load the data and prepare for training.

Use the same setup from R017:

Load the Mtb dataset and clean the localization labels
Build your feature matrix X with length + amino acid counts (21 features)
Split into train/test sets (80/20, random_state=42)

If you still have your R017 notebook, copy the setup code from there. This should be quick.

Success check:

X has shape (n_proteins, 21)
y has your cleaned location labels
Data is split into train/test

Exercise 1: Train All Three Models

Goal: Train logistic regression, random forest, and XGBoost on your training data.

Create and fit a LogisticRegression model (you know this one from R017)
Create and fit a RandomForestClassifier
Create and fit an XGBClassifier

Hints:

LogisticRegression is in sklearn.linear_model
RandomForestClassifier is in sklearn.ensemble
XGBClassifier is in the xgboost package (install with !pip install xgboost if needed)
For Random Forest and XGBoost, try n_estimators=100 and random_state=42

Success check:

All three models train without errors
You have three fitted model objects

Exercise 2: Compare Performance

Goal: Evaluate all three models on the test set.

Get predictions from each model on X_test
Compute accuracy for each using accuracy_score
Display results side by side (print, table, however you like)

Hint: You could loop through your models, or just call .predict() and accuracy_score() three times. Your choice.

Questions:

Which model has the highest accuracy?
Are the differences large or small?
Does the winner surprise you?

Success check:

You have accuracy scores for all three models
You can identify the best performer

Exercise 3: Meet Your Fighters (Intuition)

Goal: Understand why these models behave differently.

Read the descriptions below, then answer the questions:

Logistic Regression fits a linear boundary. Imagine drawing a straight line (or plane) to separate classes. If the true boundary is curved or complex, logistic regression will struggle.

Random Forest builds many decision trees, each asking questions like "Is length > 500?" and "Is mass > 30000?" Each tree votes, and the majority wins. This lets it capture non-linear patterns.

XGBoost also uses decision trees, but trains them sequentially. Each new tree focuses on the examples the previous trees got wrong. This "boosting" often squeezes out extra performance.

Questions to answer:

If your data has a simple, linear relationship between features and labels, which model would you expect to do well?
If your data has complex, non-linear patterns, which models might do better?
Why might a simpler model sometimes be preferred even if it's slightly less accurate?

Success check:

You can explain the basic idea behind each model
You understand that "more complex" doesn't always mean "better"

Exercise 4: Visualize the Decision Boundaries (Optional)

Goal: See how each model carves up the feature space.

This exercise is optional but illuminating. Create a 2D plot showing the decision boundaries of each model.

Hint: Search "sklearn plot decision boundary" or ask your favorite chatbot for a helper function. The idea:

Create a grid of points covering your feature space
Predict the class for each grid point
Color the grid by predicted class
Overlay your actual data points

Success check:

You can see that logistic regression draws straight lines
Random forest and XGBoost draw more complex boundaries

Exercise 5: Reflection

Goal: Consolidate your understanding.

Answer in your notebook (1-2 sentences each):

Which model won on this dataset? Why do you think that is?
When would you choose logistic regression over XGBoost?
What other features (beyond length and mass) might help improve predictions?

Deliverables

Submit your completed notebook (.ipynb) with:

All three models trained and evaluated
Accuracy comparison clearly displayed
Reflection answers in markdown cells

Submission

Submit your notebook here