🎉 Sent!
You made it to the top. Submit your work above!
Submission
Deliverables
Submit your completed notebook (.ipynb) with:
- Dataset exploration (molecule grid, logBB histogram, class counts)
- Descriptor distributions colored by BBB class + statistical tests
- 2D chemical space visualization with interpretation
- Model metrics (accuracy, ROC-AUC, ROC curve) with random split
- Model metrics with similarity-based split (leakage check)
- Reflection answers in markdown cells
Exercise 6: Reflection
Goal: Consolidate what you learned about molecular ML.
Answer in your notebook (2-3 sentences each):
-
Which molecular descriptors seemed most predictive of BBB permeability? Why might that be, biologically?
-
Remember your UMAP plot where BBB+ and BBB- overlapped? You might have thought "this isn't learnable." But then your classifier got decent performance. What's going on? Why can a model learn something that isn't obvious in a 2D projection?
-
How did your model's performance change when you split by molecular similarity instead of randomly? What's the lesson here?
Optional stretch goals:
- Try combining both descriptors AND fingerprints as features. Does ROC-AUC improve beyond either alone?
- Swap in an
MLPClassifier(simple neural network). How does it compare to Random Forest?
Exercise 5: Check for Data Leakage
Goal: Apply what you learned in R019: The Leaky Pipeline to molecules.
In R019, you learned that structurally similar proteins in both train and test sets can inflate your accuracy — the model memorizes families instead of learning generalizable patterns. The exact same problem applies to small molecules.
Your task: Instead of a random train/test split, split by molecular similarity.
- Compute pairwise similarity between all molecules using your fingerprints (Tanimoto similarity works well here)
- Cluster the molecules by similarity
- Split by cluster: entire clusters go to either train OR test, not both
- Retrain your model and evaluate
A note on clustering: There's no single "right" way to cluster molecules. You might use agglomerative clustering with a distance threshold, DBSCAN, or something else entirely. And whatever method you choose will have parameters (e.g., what similarity threshold defines "too similar"?). That's okay — the point is to explore. Start simple, see what happens, then try different thresholds or methods if you're curious. In real research, you'd want to justify your choices, but for now, just get something working and observe the effect.
Questions to answer:
- Does your model's performance drop with the clustered split?
- If so, how much of your original accuracy was inflated by leakage?
- What does this tell you about how you should evaluate molecular ML models in real research?
You've done this before with proteins — now you should be able to do it with molecules. If you're stuck, revisit R019.
Exercise 4: Train a Classifier
Goal: Build a simple classifier to predict BBB+ (permeable) vs BBB- (non-permeable).
Step 1: Train with descriptors
Start with the 7 descriptors you computed (MW, LogP, TPSA, etc.). Split your data into 80% training / 20% test (stratified by BBB class), then train a RandomForestClassifier.
Hint: Use train_test_split from sklearn with stratify=y to keep class balance.
Step 2: Evaluate
Report accuracy and ROC-AUC. Plot the ROC curve.
Hint: Look into accuracy_score, roc_auc_score, and RocCurveDisplay from sklearn.metrics. You'll need both hard predictions (predict) and probability scores (predict_proba).
You should get a ROC-AUC around 0.90-0.95. Wait — that's really good for just 7 simple features! Are you surprised? With just molecular weight, LogP, and a few other basic properties, you can predict BBB permeability pretty well.
Step 3: Feature importance
Check which features your model finds most important. For Random Forest, look at the .feature_importances_ attribute. Plot the top features as a bar chart.
This is a big advantage of descriptors: interpretability. You can explain to a chemist exactly which properties matter. Can you make biological sense of which descriptors are most predictive?
Step 4: Now try fingerprints
You got great performance with simple, interpretable features. But what about fingerprints — those 1024-bit vectors encoding molecular substructures?
Train a new model using fingerprints instead of descriptors. Compare the ROC-AUC.
Questions to consider:
- Do fingerprints outperform descriptors? By how much? (You might see something like 0.95 vs 0.96)
- Is a 0.01 improvement in ROC-AUC actually meaningful? Is it worth losing the ability to explain which molecular properties matter?
- When would you choose interpretable features over marginally better black-box features? When might you make the opposite choice?
Success check:
- You have ROC-AUC for both descriptor-based and fingerprint-based models
- You can articulate the interpretability vs. performance tradeoff
- You identified the most important descriptors and can explain why they might matter biologically
Exercise 3: Visualize Chemical Space
Goal: See how molecules group together based on structural similarity.
Step 1: Compute molecular fingerprints
Generate Morgan fingerprints for each molecule. Use 1024 bits and radius 2.
Hint: Look into AllChem.GetMorganFingerprintAsBitVect. You'll need to convert each fingerprint to a numpy array and stack them into a matrix.
These binary vectors encode each molecule's substructure patterns — a digital fingerprint.
Step 2: Project into 2D
Use UMAP (or t-SNE) to reduce your 1024-dimensional fingerprints down to 2D coordinates.
Hint: You installed umap-learn earlier. Ask your chatbot how to use UMAP for dimensionality reduction.
This step may take a minute — that's normal.
Step 3: Plot and interpret
Create a scatter plot of your 2D coordinates, colored by BBB class.
Questions to answer:
- Do BBB+ and BBB- compounds cluster separately, or do they overlap?
- If they overlap a lot... does that mean we can't machine learn this? Is the UMAP the final word on whether a classification problem is solvable? Think about what UMAP is actually showing you (a 2D projection of structural similarity) versus what your classifier will be working with.
Don't despair if the UMAP looks messy — we'll revisit this question after you train your model.
Success check:
- You have a 2D scatter plot with visible structure
- You wrote a short interpretation of what you see (and your current hypothesis about whether this is learnable)
Exercise 2: Generate Molecular Descriptors
Goal: Convert molecules into numerical features that describe their physicochemical properties.
Step 1: Compute RDKit descriptors
Calculate the following for each molecule and add them as new columns:
| Descriptor | What it measures |
|---|---|
MolWt | Molecular weight |
MolLogP | Predicted lipophilicity |
TPSA | Topological polar surface area |
NumHDonors | Hydrogen-bond donors |
NumHAcceptors | Hydrogen-bond acceptors |
NumRotatableBonds | Molecular flexibility |
FractionCSP3 | Fraction of sp3-hybridized carbons |
Hint: These are all in rdkit.Chem.Descriptors. Ask your chatbot how to compute RDKit descriptors for a DataFrame of molecules.
Step 2: Explore distributions by class
Create histograms for ALL of the descriptors you computed, colored by BBB class. You want to see if the distributions differ between BBB+ and BBB- molecules.
Hint: Seaborn's histplot with the hue parameter is useful here.
Step 3: Test for statistical significance
Your histograms give you a visual sense of which descriptors differ between classes. Now quantify it: run a t-test (or similar) for each descriptor to see if the difference between BBB+ and BBB- is statistically significant.
Hint: scipy.stats.ttest_ind is your friend here.
Questions to answer:
- Which descriptors show statistically significant differences (p < 0.05)?
- Do BBB+ compounds tend to be more lipophilic (higher LogP)? Less polar (lower TPSA)?
- Does your visual intuition from the histograms always match the statistical test results? Or not exactly?
Advanced hint: You're running multiple statistical tests (one per descriptor). If you want to be rigorous, you should correct for multiple hypothesis testing — otherwise you'll get false positives. Look into Bonferroni correction or ask your chatbot about it.
Write up your findings in a markdown cell.
Success check:
- Your DataFrame has new descriptor columns
- You have histograms showing class differences
- You noted which descriptors seem predictive
Exercise 1: Explore the Dataset
Goal: Understand the structure and content of B3DB before modeling.
Step 1: Inspect the data
Display the first few rows and understand the key columns:
- logBB — Brain-to-blood concentration ratio. Positive = higher brain concentration.
- BBB+/BBB- — Categorical label. BBB+ = permeable, BBB- = non-permeable.
- threshold — The cutoff used to define BBB+ vs BBB-. Our understanding: B3DB compiled data from many sources that used different thresholds, and this column records what each source used. But honestly, we don't fully understand why so many values are NaN here — can you figure out what's going on?
Key observation: All molecules have a BBB+/BBB- label, and the classes are reasonably balanced. But not all molecules have a logBB value — many are missing. Why is that? How can you have a BBB+/BBB- label without a logBB ratio? See if you can figure this out from the paper, or reason about how a dataset like this might have been compiled.
Step 2: Generate molecule objects
Use RDKit to convert the SMILES strings into molecule objects. Add a new column to your DataFrame with the molecule objects.
Hint: Look into Chem.MolFromSmiles. If you're stuck, ask your chatbot how to apply an RDKit function to a pandas column.
Heads up: Two of the molecules will fail to parse (you'll get None instead of a molecule object). Investigate: which ones failed? Why? Can you fix them or should you drop them? It's okay if you can't fix them — just don't let them break your downstream code.
Step 3: Visualize sample molecules
Pick 10 random molecules and display them as 2D structures in a grid.
Step 4: Basic statistics
- Plot a histogram of logBB values (remember: not all molecules have this value — how does that affect your histogram?)
- Count how many compounds are BBB+ vs BBB-
- Compute the average molecular weight
Success check:
- You can display molecules as 2D structures
- You know the class balance (how many BBB+ vs BBB-)
- You have a sense of the logBB distribution
Exercise 0: Setup
Goal: Load the B3DB dataset and prepare your environment.
Step 1: Install dependencies
You'll need rdkit and umap-learn for this route. Figure out how to install them in your Colab environment.
Step 2: Load the dataset
Download and load the B3DB classification dataset:
import pandas as pd
url = "https://raw.githubusercontent.com/theochem/B3DB/main/B3DB/B3DB_classification.tsv"
df = pd.read_csv(url, sep='\t')
Step 3: Explore the dataset
Take a few minutes to poke around. What columns are in this dataset? What do they contain?
df.head()
df.columns
df.shape
These are just some suggestions to get you started — don't limit yourself to only these! Look at the column names and a few rows. What information do you have for each molecule? What's the target variable you'll be predicting?
About B3DB: This dataset comes from Meng et al., "A curated diverse molecular database of blood-brain barrier permeability with chemical descriptors." Sci Data 8, 289 (2021). https://doi.org/10.1038/s41597-021-01069-5
Take 5 minutes to learn more about where this data came from. Download the PDF and upload it to your favorite chatbot, then ask: "Summarize this paper. What molecules are in the dataset? How were they labeled? What makes this dataset useful for ML?"
Step 4: Create binary labels
For modeling, you'll need numeric labels (0 and 1) instead of strings. Run this:
df['BBB_class'] = (df['BBB+/BBB-'] == 'BBB+').astype(int)
Now explain in a markdown cell: what did that line just do? Break it down piece by piece — what does == 'BBB+' produce? What does .astype(int) do to that? Why do we need numeric labels for ML?
Success check:
- Dataset loads without errors
- You have a DataFrame with SMILES and BBB labels
- You know how many molecules are in the dataset
Why this route exists
The blood-brain barrier is one of the most critical filters in pharmacology — it controls which molecules can reach the central nervous system. Predicting BBB permeability early can help identify CNS-active drugs and flag compounds likely to fail.
In this route, you'll use cheminformatics and machine learning to:
- Represent molecules numerically
- Explore chemical space visually
- Train a model to classify BBB permeability
If you need a refresher on working with small molecules and RDKit, go climb some routes over at Wall W04: Small Molecule Representations.
What you'll be able to do after this route
By the end, you can:
- Represent small molecules as numerical features (descriptors and fingerprints)
- Visualize chemical space using dimensionality reduction
- Train a classifier to predict BBB permeability
- Interpret which molecular properties influence predictions
Key definitions
Blood-brain barrier (BBB) A selective barrier formed by endothelial cells that controls which molecules can pass from blood into the brain. Critical for CNS drug development.
logBB The logarithm of the brain-to-blood concentration ratio. Positive values indicate higher brain penetration. B3DB uses logBB > -1 as the threshold for BBB+.
Molecular descriptor A numerical value that encodes a physicochemical property of a molecule (e.g., molecular weight, lipophilicity, polar surface area).
Morgan fingerprint A circular fingerprint that encodes molecular substructures as a binary vector. Useful for measuring structural similarity between molecules.
Route 023: Predicting BBB Permeability
- RouteID: 023
- Wall: The Machine Learning Offwidth (W06)
- Grade: 5.10d
- Routesetters: Adrian & Abhiram
- Time: ~40 minutes
- Dataset: B3DB (Blood-Brain Barrier Database)
🧗 Base Camp
Start here and climb your way up!