🎉 Sent!
You made it to the top. Submit your work above!
Submission
Deliverables
Submit your completed notebook (.ipynb) with:
- Results from at least 3 clustering methods
- Comparison table (clusters, sizes, runtime)
- Train/test accuracy for each clustering approach
- Reflection answers in markdown cells
Exercise 4: Reflection
Goal: Synthesize what you learned.
Answer in your notebook (2-3 sentences each):
-
Which clustering method gave the most/fewest clusters? Why do you think that is?
-
Did the choice of clustering method affect your model's accuracy much? What does that tell you?
-
If you were starting a new protein ML project, which clustering approach would you use and why?
Exercise 3: Compare Everything
Goal: See if clustering method matters for model performance.
For each clustering method you tried:
- Do a cluster-aware train/test split
- Train a classifier (Random Forest works well)
- Record the accuracy
Create a comparison table:
| Method | # Clusters | Largest Cluster | Runtime | Test Accuracy |
|---|---|---|---|---|
| Connected Components | ? | ? | ? | ? |
| Agglomerative | ? | ? | ? | ? |
| DBSCAN | ? | ? | ? | ? |
Questions:
- Does the clustering method significantly affect accuracy?
- Which method is fastest?
- Which gives the most "balanced" clusters?
Exercise 2: Try Other Methods
Goal: Explore different clustering algorithms.
You've already done threshold + connected components in R019. Now try these:
Option A: Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering
# Convert similarity to distance
distance_matrix = 1 - similarity_matrix
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=0.2, # adjust this
metric='precomputed',
linkage='average'
)
labels = clustering.fit_predict(distance_matrix)
Option B: DBSCAN
from sklearn.cluster import DBSCAN
distance_matrix = 1 - similarity_matrix
clustering = DBSCAN(
eps=0.2, # adjust this
min_samples=2,
metric='precomputed'
)
labels = clustering.fit_predict(distance_matrix)
Try at least two of these methods. Experiment with the thresholds.
Questions to explore:
- How many clusters does each method produce?
- How does changing the threshold affect the results?
- Do some methods create more "singleton" clusters (clusters with just one protein)?
Exercise 1: Recap Your R019 Approach
Goal: Establish a baseline with the method you already know.
In R019, you clustered proteins using:
- Cosine similarity matrix
- Threshold (e.g., 0.8)
- Connected components
Recreate this clustering (or load your results from R019). Record:
- Number of clusters
- Size of the largest cluster
- How long it took
This is your baseline for comparison.
Exercise 0: Setup
Goal: Load your similarity matrix from R019.
You should already have:
- Mtb protein embeddings loaded
- Similarity matrix computed
- Labels for localization prediction
If you saved your similarity matrix from R019, load it. Otherwise, recompute it:
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
Success check:
- You have your similarity matrix ready
- You have your labels ready
Why this route exists
In R019, you learned one way to cluster proteins: threshold + connected components. It works, but it's not the only option.
Different clustering methods make different assumptions and produce different results. Some create many small clusters, others create fewer large ones. Some are fast, others are slow. Some handle outliers well, others don't.
In this route, you'll try multiple methods and see how the choice affects your train/test splits and model performance.
What you'll be able to do after this route
By the end, you can:
- Apply multiple clustering algorithms to protein data
- Compare clustering results (size, count, runtime)
- Understand trade-offs between different methods
- Make informed choices about clustering in your projects
Key definitions
Agglomerative Clustering Bottom-up approach: start with each point as its own cluster, then merge the closest pairs until a stopping criterion is met.
DBSCAN Density-based clustering: finds clusters as dense regions separated by sparse regions. Good at finding outliers.
Linkage In agglomerative clustering, how you measure distance between clusters. "Average" uses mean distance between all pairs.
Route 020: Clustering Methods Showdown
- RouteID: 020
- Wall: The Machine Learning Offwidth (W06)
- Grade: 5.9
- Routesetter: Adrian
- Time: ~30 minutes
- Dataset: Same as R019 (Mtb embeddings + similarity matrix)
🧗 Base Camp
Start here and climb your way up!