Submission

Submit your notebook here

Deliverables

Submit your completed notebook (.ipynb) with:

Results from at least 3 clustering methods
Comparison table (clusters, sizes, runtime)
Train/test accuracy for each clustering approach
Reflection answers in markdown cells

Exercise 4: Reflection

Goal: Synthesize what you learned.

Answer in your notebook (2-3 sentences each):

Which clustering method gave the most/fewest clusters? Why do you think that is?
Did the choice of clustering method affect your model's accuracy much? What does that tell you?
If you were starting a new protein ML project, which clustering approach would you use and why?

Exercise 3: Compare Everything

Goal: See if clustering method matters for model performance.

For each clustering method you tried:

Do a cluster-aware train/test split
Train a classifier (Random Forest works well)
Record the accuracy

Create a comparison table:

Method	# Clusters	Largest Cluster	Runtime	Test Accuracy
Connected Components	?	?	?	?
Agglomerative	?	?	?	?
DBSCAN	?	?	?	?

Questions:

Does the clustering method significantly affect accuracy?
Which method is fastest?
Which gives the most "balanced" clusters?

Exercise 2: Try Other Methods

Goal: Explore different clustering algorithms.

You've already done threshold + connected components in R019. Now try these:

Option A: Agglomerative Clustering

from sklearn.cluster import AgglomerativeClustering

# Convert similarity to distance
distance_matrix = 1 - similarity_matrix

clustering = AgglomerativeClustering(
    n_clusters=None,
    distance_threshold=0.2,  # adjust this
    metric='precomputed',
    linkage='average'
)
labels = clustering.fit_predict(distance_matrix)

Option B: DBSCAN

from sklearn.cluster import DBSCAN

distance_matrix = 1 - similarity_matrix

clustering = DBSCAN(
    eps=0.2,  # adjust this
    min_samples=2,
    metric='precomputed'
)
labels = clustering.fit_predict(distance_matrix)

Try at least two of these methods. Experiment with the thresholds.

Questions to explore:

How many clusters does each method produce?
How does changing the threshold affect the results?
Do some methods create more "singleton" clusters (clusters with just one protein)?

Exercise 1: Recap Your R019 Approach

Goal: Establish a baseline with the method you already know.

In R019, you clustered proteins using:

Cosine similarity matrix
Threshold (e.g., 0.8)
Connected components

Recreate this clustering (or load your results from R019). Record:

Number of clusters
Size of the largest cluster
How long it took

This is your baseline for comparison.

Exercise 0: Setup

Goal: Load your similarity matrix from R019.

You should already have:

Mtb protein embeddings loaded
Similarity matrix computed
Labels for localization prediction

If you saved your similarity matrix from R019, load it. Otherwise, recompute it:

from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)

Success check:

You have your similarity matrix ready
You have your labels ready

Why this route exists

In R019, you learned one way to cluster proteins: threshold + connected components. It works, but it's not the only option.

Different clustering methods make different assumptions and produce different results. Some create many small clusters, others create fewer large ones. Some are fast, others are slow. Some handle outliers well, others don't.

In this route, you'll try multiple methods and see how the choice affects your train/test splits and model performance.

What you'll be able to do after this route

By the end, you can:

Apply multiple clustering algorithms to protein data
Compare clustering results (size, count, runtime)
Understand trade-offs between different methods
Make informed choices about clustering in your projects

Key definitions

Agglomerative Clustering Bottom-up approach: start with each point as its own cluster, then merge the closest pairs until a stopping criterion is met.

DBSCAN Density-based clustering: finds clusters as dense regions separated by sparse regions. Good at finding outliers.

Linkage In agglomerative clustering, how you measure distance between clusters. "Average" uses mean distance between all pairs.

Route 020: Clustering Methods Showdown

RouteID: 020
Wall: The Machine Learning Offwidth (W06)
Grade: 5.9
Routesetter: Adrian
Time: ~30 minutes
Dataset: Same as R019 (Mtb embeddings + similarity matrix)