🧗 Start Here
Scroll down to complete this route
Midterm Route M3: Drug Lookalikes
- RouteID: M003
- Wall: Small Molecule Representations (W04)
- Grade: 5.10d (Midterm — Extra Credit)
- Routesetter: Adrian
- Date: 02/05/2026
The Setup
A medicinal chemist friend calls you in a panic:
"I'm working on a project and I need to find drugs that are structurally similar to a few molecules I'm interested in. I don't have access to expensive software — can you help me search a drug database using the fingerprint stuff you learned in class?"
You remember from R010 that molecular fingerprints encode structural features, and Tanimoto similarity can compare them. Time to scale up from a handful of steroids to a real drug database.
Your Mission
- Load a database of FDA-approved drugs
- Compute fingerprints for all molecules
- Given query molecules, find the most similar drugs by Tanimoto similarity
- Investigate whether structurally similar drugs have similar biological functions
Prerequisites
- R009 (From SMILES to Molecular Properties) — SMILES parsing, RDKit basics
- R010 (Morgan the Finger Printer) — fingerprints, Tanimoto similarity
Data Files
| File | Description | Link |
|---|---|---|
FDA_Approved_structures.csv | ~2,500 FDA-approved drugs with names and SMILES | Download |
Dataset credit: This dataset comes from Kaggle. Every serious student of AI/ML for biochemistry should get comfortable browsing Kaggle — it's a goldmine of curated datasets for practice and exploration.
Your Query Molecules
Find drugs similar to these well-known compounds:
| Name | What it does | SMILES |
|---|---|---|
| Aspirin | Pain reliever, anti-inflammatory | CC(=O)OC1=CC=CC=C1C(=O)O |
| Caffeine | Stimulant | CN1C=NC2=C1C(=O)N(C(=O)N2C)C |
| Metformin | Diabetes medication | CN(C)C(=N)NC(=N)N |
| Ibuprofen | Pain reliever, anti-inflammatory | CC(C)CC1=CC=C(C=C1)C(C)C(=O)O |
Exercise 1: Load the Drug Database
Goal: Load the approved drugs dataset and convert SMILES to RDKit molecules.
- Load
FDA_Approved_structures.csvinto a DataFrame - Convert all SMILES to RDKit Mol objects
- Handle any invalid SMILES gracefully (drop or flag them)
- How many valid molecules do you have?
Success check:
- You have ~2,500 valid Mol objects
- No crashes from bad SMILES
Exercise 2: Compute Fingerprints
Goal: Generate fingerprints for all drugs in the database.
- Choose a fingerprint type (Morgan recommended — you know it well from R010)
- Pick a radius (r=2 is standard for drug-likeness)
- Compute fingerprints for all molecules
- Store them in a way you can use for similarity calculations
Hints:
- This is the same process as R010, just more molecules
- Consider storing fingerprints in a list or adding to your DataFrame
Success check:
- Every valid molecule has a fingerprint
- You can retrieve any molecule's fingerprint by index or name
Exercise 3: Build Query Fingerprints
Goal: Create Mol objects and fingerprints for your query molecules.
- Parse the SMILES for Aspirin, Caffeine, Metformin, and Ibuprofen
- Compute fingerprints using the same settings as Exercise 2
- Verify they parsed correctly by drawing them
Success check:
- All 4 query molecules have valid Mol objects and fingerprints
Exercise 4: Similarity Search
Goal: For each query molecule, find the most similar drugs in the database.
- For each query, compute Tanimoto similarity to ALL drugs in the database
- Rank drugs by similarity
- Extract the top 10 most similar drugs for each query
- Store results in a table: query name, drug name, similarity score
Hints:
DataStructs.TanimotoSimilarity(fp1, fp2)computes similarity between two fingerprints- You can loop, or be clever with vectorization
Success check:
- You have top 10 hits for each of the 4 queries
- Similarity scores are between 0 and 1
- The query molecule itself (if in the database) should have similarity = 1.0
Exercise 5: Investigate Your Hits
Goal: Explore whether structurally similar drugs have similar functions.
For each query molecule:
- Look at the top 5 hits (by name)
- Look up what those drugs do (Google, Wikipedia, DrugBank — whatever works)
- Do they share a therapeutic use with the query? Same drug class? Same target?
Questions to answer in your notebook:
- For Aspirin's top hits — are they also pain relievers or anti-inflammatories?
- For Caffeine's top hits — are they also stimulants or related?
- Do any hits surprise you? Similar structure but very different use?
Write 3-4 sentences per query about what you found.
Exercise 6: Reflection
Goal: Connect structure to function.
Answer in your notebook (2-3 sentences each):
-
Based on your results, how well does "similar structure = similar function" hold up?
-
When might two drugs have similar fingerprints but different biological effects?
Deliverables
Submit your completed notebook (.ipynb) with:
- All code cells executed
- Top 10 similar drugs for each query
- Your investigation of the hits
- Reflection answers
- A Logbook section at the end with
[LOGBOOK]entries
Reminder: We've switched to including [LOGBOOK] entries directly in each route's notebook rather than in a separate file. Add your logbook entries as markdown cells at the end of this notebook.
Submission
🎉 Route Complete!
Great work!