🧗 Start Here
Scroll down to complete this route
Route: Morgan the Finger Printer
- RouteID: R010
- Wall: Small Molecule Representations
- Grade: 5.10c
- Route-setter: Shivansh
- Time: ~40 minutes
- Dataset: steroid_dataset.csv
Why this route exists
Before we can cluster, search, or model chemical space, we need a way to represent molecules numerically. Cheminformatics solves this using fingerprints: fixed-length binary vectors that encode structural features. Different fingerprint families capture different chemistry — circular neighborhoods, key patterns, or topological paths — and those differences matter for downstream tasks.
In this route, we'll compute several fingerprints and visualize how they change molecular relationships.
What you'll build
During this climb, you will:
- Generate Morgan, MACCS, and Topological fingerprints for a set of molecules.
- Compare molecules using Tanimoto similarity matrices.
- Visualize similarity patterns as heatmaps.
- Inspect bits from Morgan fingerprints to reveal the substructures that activate them.
These skills form the foundations of chemical similarity search, QSAR models, and representation learning.
Exercise 0: The Knot Check (Setup & Syntax)
Goal: Make sure our tools are clipped in correctly before climbing.
Checklist:
- Install RDKit in Colab
- Import required libraries
- Load your list SMILES strings from dataset: steroid_dataset
- Convert SMILES → RDKit Mol objects
- Display molecules to verify parsing
Belay Check:
- All SMILES parse to Mol objects (no
None) - Molecule count matches your dataset
If something fails here, don't continue upward.
Exercise 1: The Estimator (Fingerprint Generation)
Goal: Compute multiple different fingerprints for each molecule.
We will compute:
- Morgan (circular) fingerprints
- radii: 1, 2, 3
- size: ~1024 bits
- uses
bitInfofor introspection (used later)
- MACCS Keys
- fixed ~166 bit key-based fingerprint
- RDKit Topological
- encodes path-based graph features
Route Beta (Tools):
AllChem.GetMorganFingerprintAsBitVect(m, radius, nBits, bitInfo)
MACCSkeys.GenMACCSKeys(m)
RDKFingerprint(m)
Belay Check:
Confirm for at least one molecule:
- Fingerprints are bit vectors (not objects)
- Bit length matches expectations
Exercise 2: The Classifier (Tanimoto Similarity)
Goal: Compare structural similarity using Tanimoto similarity, defined for bit vectors as:
T(A, B) = |A ∩ B| / |A ∪ B|
Task:
For each fingerprint family:
- Compute an N × N pairwise similarity matrix
where N = number of molecules.
Belay Check:
- Similarity matrix is square
- Diagonal entries = 1.0
- Values in [0,1]
If diagonals are not 1.0, recheck your vector types (some users accidentally feed count vectors!)
Exercise 3: The Parser (Bit-Level Introspection)
Goal: Inspect Morgan fingerprint bits and discover which substructures caused them to fire.
This is the genuinely insightful step — fingerprints are no longer magic black boxes.
Beta (Tools):
Morgan calls can optionally populate:
bitInfo = {}
AllChem.GetMorganFingerprintAsBitVect(m, radius, nBits, bitInfo=bitInfo)
bitInfo maps: bit → (atom_id, radius)
Tasks:
- Identify 3–5 bits that appear frequently across molecules.
- Use
bitInfoto retrieve atom environments for those bits. - Use RDKit drawing utilities to visualize each substructure.
Belay Check:
For each inspected bit:
- The drawn substructure appears without errors
Exercise 4: The Send (Similarity Heatmaps + Interpretation)
Goal: Visualize molecular relationships and compare fingerprints.
Tasks:
- Convert each similarity matrix into a heatmap.
- Label axes with molecule names or SMILES.
- Compare clustering patterns across:
- Morgan (r=1)
- Morgan (r=2)
- Morgan (r=3)
- MACCS
- Topological
Submission
Submit your files by uploading them to the following Google Form:
Please upload both:
- your .ipynb notebook
- your logbook file
Make sure filenames follow the naming conventions above.
🎉 Route Complete!
Great work!