First Ascent by
Eve Zhang
The BACE-1 Face
Drug Discovery / Medicinal Chemistry
The Proposed Route
A clean, well-scoped drug discovery problem: predict BACE-1 inhibitor activity from Morgan fingerprints, comparing a logistic regression baseline to a Random Forest. The scaffold-based split stretch goal adds a chemically meaningful generalization test โ can the model recognize a new scaffold it has never seen?
๐ง The Crux
BACE-1 is a well-trodden benchmark โ the danger is producing a result that looks good on paper (high ROC-AUC on random split) but doesn't generalize. The scaffold split stretch is where this route gets interesting and difficult. Class imbalance is a secondary crux.
โ ๏ธ Pre-Climb Checklist
โ Dataset is publicly available and well-curated (MoleculeNet). โ Pipeline is well-scoped for CHEM 169. โ ๏ธ Random split ROC-AUC will likely be high โ don't stop there. โ ๏ธ If doing scaffold split, use RDKit's MurckoScaffold decomposition and report the performance drop explicitly.
Guidance
- Scaffold-based split is the most interesting part โ prioritize it
- Gap between random-split AUC and scaffold-split AUC = memorization vs generalization
- That gap is a finding worth reporting โ lead with it in the writeup
Source proposal: Zhang_Eve_CHEM169_Final_Project_Proposal.pdf
CHEM 169/269 ยท Applied AI & Machine Learning for Biochemistry