🧗 Start Here
Scroll down to complete this route
Route 018: Choose Your Fighter
- RouteID: 018
- Wall: The Machine Learning Offwidth (W06)
- Grade: 5.7
- Routesetter: Adrian
- Time: ~25 minutes
- Dataset: M. tuberculosis proteome with cellular location labels
Why this route exists
In R017, you trained a logistic regression model. But logistic regression is just one tool in the toolbox. There are other models — with different strengths and weaknesses — that might work better for your problem.
In this route, you'll meet the three workhorses of classical machine learning:
- Logistic Regression — fast, interpretable, works well when relationships are simple
- Random Forest — an ensemble of decision trees, handles complex patterns
- XGBoost — boosted trees, often the champion on tabular data
These are your quick wins. When your PI says "can we predict X from Y?", these are the first tools you reach for. No GPUs required, no deep learning complexity, just solid models that work on tabular data. In many real-world scenarios, one of these three will be your final model. Master them now, and you'll always have a strong starting point.
You'll train all three on the same dataset and see which one wins.
What you'll be able to do after this route
By the end, you can:
- Train three different classifiers on the same data
- Compare model performance using accuracy
- Explain the basic intuition behind each model type
- Make an informed choice about which model to use
Key definitions
Logistic Regression A linear model that draws a straight boundary between classes. Fast and interpretable, but struggles with complex patterns.
Random Forest An ensemble of many decision trees, each trained on a random subset of data. Combines their votes for the final prediction. Robust and handles non-linear relationships.
XGBoost Boosted decision trees — each new tree focuses on fixing the mistakes of the previous ones. Often the top performer on structured/tabular data.
Ensemble A model that combines multiple simpler models to make better predictions.
Exercise 0: Setup
Goal: Load the data and prepare for training.
Use the same setup from R017:
- Load the Mtb dataset and clean the localization labels
- Build your feature matrix X with length + amino acid counts (21 features)
- Split into train/test sets (80/20,
random_state=42)
If you still have your R017 notebook, copy the setup code from there. This should be quick.
Success check:
- X has shape
(n_proteins, 21) - y has your cleaned location labels
- Data is split into train/test
Exercise 1: Train All Three Models
Goal: Train logistic regression, random forest, and XGBoost on your training data.
- Create and fit a
LogisticRegressionmodel (you know this one from R017) - Create and fit a
RandomForestClassifier - Create and fit an
XGBClassifier
Hints:
LogisticRegressionis insklearn.linear_modelRandomForestClassifieris insklearn.ensembleXGBClassifieris in thexgboostpackage (install with!pip install xgboostif needed)- For Random Forest and XGBoost, try
n_estimators=100andrandom_state=42
Success check:
- All three models train without errors
- You have three fitted model objects
Exercise 2: Compare Performance
Goal: Evaluate all three models on the test set.
- Get predictions from each model on
X_test - Compute accuracy for each using
accuracy_score - Display results side by side (print, table, however you like)
Hint: You could loop through your models, or just call .predict() and accuracy_score() three times. Your choice.
Questions:
- Which model has the highest accuracy?
- Are the differences large or small?
- Does the winner surprise you?
Success check:
- You have accuracy scores for all three models
- You can identify the best performer
Exercise 3: Meet Your Fighters (Intuition)
Goal: Understand why these models behave differently.
Read the descriptions below, then answer the questions:
Logistic Regression fits a linear boundary. Imagine drawing a straight line (or plane) to separate classes. If the true boundary is curved or complex, logistic regression will struggle.
Random Forest builds many decision trees, each asking questions like "Is length > 500?" and "Is mass > 30000?" Each tree votes, and the majority wins. This lets it capture non-linear patterns.
XGBoost also uses decision trees, but trains them sequentially. Each new tree focuses on the examples the previous trees got wrong. This "boosting" often squeezes out extra performance.
Questions to answer:
- If your data has a simple, linear relationship between features and labels, which model would you expect to do well?
- If your data has complex, non-linear patterns, which models might do better?
- Why might a simpler model sometimes be preferred even if it's slightly less accurate?
Success check:
- You can explain the basic idea behind each model
- You understand that "more complex" doesn't always mean "better"
Exercise 4: Visualize the Decision Boundaries (Optional)
Goal: See how each model carves up the feature space.
This exercise is optional but illuminating. Create a 2D plot showing the decision boundaries of each model.
Hint: Search "sklearn plot decision boundary" or ask your favorite chatbot for a helper function. The idea:
- Create a grid of points covering your feature space
- Predict the class for each grid point
- Color the grid by predicted class
- Overlay your actual data points
Success check:
- You can see that logistic regression draws straight lines
- Random forest and XGBoost draw more complex boundaries
Exercise 5: Reflection
Goal: Consolidate your understanding.
Answer in your notebook (1-2 sentences each):
- Which model won on this dataset? Why do you think that is?
- When would you choose logistic regression over XGBoost?
- What other features (beyond length and mass) might help improve predictions?
Deliverables
Submit your completed notebook (.ipynb) with:
- All three models trained and evaluated
- Accuracy comparison clearly displayed
- Reflection answers in markdown cells
Submission
🎉 Route Complete!
Great work!