Submission

Submit your notebook here

Deliverables

Submit your completed notebook (.ipynb) with:

Dataset exploration (molecule grid, logBB histogram, class counts)
Descriptor distributions colored by BBB class + statistical tests
2D chemical space visualization with interpretation
Model metrics (accuracy, ROC-AUC, ROC curve) with random split
Model metrics with similarity-based split (leakage check)
Reflection answers in markdown cells

Exercise 6: Reflection

Goal: Consolidate what you learned about molecular ML.

Answer in your notebook (2-3 sentences each):

Which molecular descriptors seemed most predictive of BBB permeability? Why might that be, biologically?
Remember your UMAP plot where BBB+ and BBB- overlapped? You might have thought "this isn't learnable." But then your classifier got decent performance. What's going on? Why can a model learn something that isn't obvious in a 2D projection?
How did your model's performance change when you split by molecular similarity instead of randomly? What's the lesson here?

Optional stretch goals:

Try combining both descriptors AND fingerprints as features. Does ROC-AUC improve beyond either alone?
Swap in an MLPClassifier (simple neural network). How does it compare to Random Forest?

Exercise 5: Check for Data Leakage

Goal: Apply what you learned in R019: The Leaky Pipeline to molecules.

In R019, you learned that structurally similar proteins in both train and test sets can inflate your accuracy — the model memorizes families instead of learning generalizable patterns. The exact same problem applies to small molecules.

Your task: Instead of a random train/test split, split by molecular similarity.

Compute pairwise similarity between all molecules using your fingerprints (Tanimoto similarity works well here)
Cluster the molecules by similarity
Split by cluster: entire clusters go to either train OR test, not both
Retrain your model and evaluate

A note on clustering: There's no single "right" way to cluster molecules. You might use agglomerative clustering with a distance threshold, DBSCAN, or something else entirely. And whatever method you choose will have parameters (e.g., what similarity threshold defines "too similar"?). That's okay — the point is to explore. Start simple, see what happens, then try different thresholds or methods if you're curious. In real research, you'd want to justify your choices, but for now, just get something working and observe the effect.

Questions to answer:

Does your model's performance drop with the clustered split?
If so, how much of your original accuracy was inflated by leakage?
What does this tell you about how you should evaluate molecular ML models in real research?

You've done this before with proteins — now you should be able to do it with molecules. If you're stuck, revisit R019.

Exercise 4: Train a Classifier

Goal: Build a simple classifier to predict BBB+ (permeable) vs BBB- (non-permeable).

Step 1: Train with descriptors

Start with the 7 descriptors you computed (MW, LogP, TPSA, etc.). Split your data into 80% training / 20% test (stratified by BBB class), then train a RandomForestClassifier.

Hint: Use train_test_split from sklearn with stratify=y to keep class balance.

Step 2: Evaluate

Report accuracy and ROC-AUC. Plot the ROC curve.

Hint: Look into accuracy_score, roc_auc_score, and RocCurveDisplay from sklearn.metrics. You'll need both hard predictions (predict) and probability scores (predict_proba).

You should get a ROC-AUC around 0.90-0.95. Wait — that's really good for just 7 simple features! Are you surprised? With just molecular weight, LogP, and a few other basic properties, you can predict BBB permeability pretty well.

Step 3: Feature importance

Check which features your model finds most important. For Random Forest, look at the .feature_importances_ attribute. Plot the top features as a bar chart.

This is a big advantage of descriptors: interpretability. You can explain to a chemist exactly which properties matter. Can you make biological sense of which descriptors are most predictive?

Step 4: Now try fingerprints

You got great performance with simple, interpretable features. But what about fingerprints — those 1024-bit vectors encoding molecular substructures?

Train a new model using fingerprints instead of descriptors. Compare the ROC-AUC.

Questions to consider:

Do fingerprints outperform descriptors? By how much? (You might see something like 0.95 vs 0.96)
Is a 0.01 improvement in ROC-AUC actually meaningful? Is it worth losing the ability to explain which molecular properties matter?
When would you choose interpretable features over marginally better black-box features? When might you make the opposite choice?

Success check:

You have ROC-AUC for both descriptor-based and fingerprint-based models
You can articulate the interpretability vs. performance tradeoff
You identified the most important descriptors and can explain why they might matter biologically

Exercise 3: Visualize Chemical Space

Goal: See how molecules group together based on structural similarity.

Step 1: Compute molecular fingerprints

Generate Morgan fingerprints for each molecule. Use 1024 bits and radius 2.

Hint: Look into AllChem.GetMorganFingerprintAsBitVect. You'll need to convert each fingerprint to a numpy array and stack them into a matrix.

These binary vectors encode each molecule's substructure patterns — a digital fingerprint.

Step 2: Project into 2D

Use UMAP (or t-SNE) to reduce your 1024-dimensional fingerprints down to 2D coordinates.

Hint: You installed umap-learn earlier. Ask your chatbot how to use UMAP for dimensionality reduction.

This step may take a minute — that's normal.

Step 3: Plot and interpret

Create a scatter plot of your 2D coordinates, colored by BBB class.

Questions to answer:

Do BBB+ and BBB- compounds cluster separately, or do they overlap?
If they overlap a lot... does that mean we can't machine learn this? Is the UMAP the final word on whether a classification problem is solvable? Think about what UMAP is actually showing you (a 2D projection of structural similarity) versus what your classifier will be working with.

Don't despair if the UMAP looks messy — we'll revisit this question after you train your model.

Success check:

You have a 2D scatter plot with visible structure
You wrote a short interpretation of what you see (and your current hypothesis about whether this is learnable)

Exercise 2: Generate Molecular Descriptors

Goal: Convert molecules into numerical features that describe their physicochemical properties.

Step 1: Compute RDKit descriptors

Calculate the following for each molecule and add them as new columns:

Descriptor	What it measures
`MolWt`	Molecular weight
`MolLogP`	Predicted lipophilicity
`TPSA`	Topological polar surface area
`NumHDonors`	Hydrogen-bond donors
`NumHAcceptors`	Hydrogen-bond acceptors
`NumRotatableBonds`	Molecular flexibility
`FractionCSP3`	Fraction of sp3-hybridized carbons

Hint: These are all in rdkit.Chem.Descriptors. Ask your chatbot how to compute RDKit descriptors for a DataFrame of molecules.

Step 2: Explore distributions by class

Create histograms for ALL of the descriptors you computed, colored by BBB class. You want to see if the distributions differ between BBB+ and BBB- molecules.

Hint: Seaborn's histplot with the hue parameter is useful here.

Step 3: Test for statistical significance

Your histograms give you a visual sense of which descriptors differ between classes. Now quantify it: run a t-test (or similar) for each descriptor to see if the difference between BBB+ and BBB- is statistically significant.

Hint: scipy.stats.ttest_ind is your friend here.

Questions to answer:

Which descriptors show statistically significant differences (p < 0.05)?
Do BBB+ compounds tend to be more lipophilic (higher LogP)? Less polar (lower TPSA)?
Does your visual intuition from the histograms always match the statistical test results? Or not exactly?

Advanced hint: You're running multiple statistical tests (one per descriptor). If you want to be rigorous, you should correct for multiple hypothesis testing — otherwise you'll get false positives. Look into Bonferroni correction or ask your chatbot about it.

Write up your findings in a markdown cell.

Success check:

Your DataFrame has new descriptor columns
You have histograms showing class differences
You noted which descriptors seem predictive

Exercise 1: Explore the Dataset

Goal: Understand the structure and content of B3DB before modeling.

Step 1: Inspect the data

Display the first few rows and understand the key columns:

logBB — Brain-to-blood concentration ratio. Positive = higher brain concentration.
BBB+/BBB- — Categorical label. BBB+ = permeable, BBB- = non-permeable.
threshold — The cutoff used to define BBB+ vs BBB-. Our understanding: B3DB compiled data from many sources that used different thresholds, and this column records what each source used. But honestly, we don't fully understand why so many values are NaN here — can you figure out what's going on?

Key observation: All molecules have a BBB+/BBB- label, and the classes are reasonably balanced. But not all molecules have a logBB value — many are missing. Why is that? How can you have a BBB+/BBB- label without a logBB ratio? See if you can figure this out from the paper, or reason about how a dataset like this might have been compiled.

Step 2: Generate molecule objects

Use RDKit to convert the SMILES strings into molecule objects. Add a new column to your DataFrame with the molecule objects.

Hint: Look into Chem.MolFromSmiles. If you're stuck, ask your chatbot how to apply an RDKit function to a pandas column.

Heads up: Two of the molecules will fail to parse (you'll get None instead of a molecule object). Investigate: which ones failed? Why? Can you fix them or should you drop them? It's okay if you can't fix them — just don't let them break your downstream code.

Step 3: Visualize sample molecules

Pick 10 random molecules and display them as 2D structures in a grid.

Step 4: Basic statistics

Plot a histogram of logBB values (remember: not all molecules have this value — how does that affect your histogram?)
Count how many compounds are BBB+ vs BBB-
Compute the average molecular weight

Success check:

You can display molecules as 2D structures
You know the class balance (how many BBB+ vs BBB-)
You have a sense of the logBB distribution

Exercise 0: Setup

Goal: Load the B3DB dataset and prepare your environment.

Step 1: Install dependencies

You'll need rdkit and umap-learn for this route. Figure out how to install them in your Colab environment.

Step 2: Load the dataset

Download and load the B3DB classification dataset:

import pandas as pd

url = "https://raw.githubusercontent.com/theochem/B3DB/main/B3DB/B3DB_classification.tsv"
df = pd.read_csv(url, sep='\t')

Step 3: Explore the dataset

Take a few minutes to poke around. What columns are in this dataset? What do they contain?

df.head()
df.columns
df.shape

These are just some suggestions to get you started — don't limit yourself to only these! Look at the column names and a few rows. What information do you have for each molecule? What's the target variable you'll be predicting?

About B3DB: This dataset comes from Meng et al., "A curated diverse molecular database of blood-brain barrier permeability with chemical descriptors." Sci Data 8, 289 (2021). https://doi.org/10.1038/s41597-021-01069-5

Take 5 minutes to learn more about where this data came from. Download the PDF and upload it to your favorite chatbot, then ask: "Summarize this paper. What molecules are in the dataset? How were they labeled? What makes this dataset useful for ML?"

Step 4: Create binary labels

For modeling, you'll need numeric labels (0 and 1) instead of strings. Run this:

df['BBB_class'] = (df['BBB+/BBB-'] == 'BBB+').astype(int)

Now explain in a markdown cell: what did that line just do? Break it down piece by piece — what does == 'BBB+' produce? What does .astype(int) do to that? Why do we need numeric labels for ML?

Success check:

Dataset loads without errors
You have a DataFrame with SMILES and BBB labels
You know how many molecules are in the dataset

Why this route exists

The blood-brain barrier is one of the most critical filters in pharmacology — it controls which molecules can reach the central nervous system. Predicting BBB permeability early can help identify CNS-active drugs and flag compounds likely to fail.

In this route, you'll use cheminformatics and machine learning to:

Represent molecules numerically
Explore chemical space visually
Train a model to classify BBB permeability

If you need a refresher on working with small molecules and RDKit, go climb some routes over at Wall W04: Small Molecule Representations.

What you'll be able to do after this route

By the end, you can:

Represent small molecules as numerical features (descriptors and fingerprints)
Visualize chemical space using dimensionality reduction
Train a classifier to predict BBB permeability
Interpret which molecular properties influence predictions

Key definitions

Blood-brain barrier (BBB) A selective barrier formed by endothelial cells that controls which molecules can pass from blood into the brain. Critical for CNS drug development.

logBB The logarithm of the brain-to-blood concentration ratio. Positive values indicate higher brain penetration. B3DB uses logBB > -1 as the threshold for BBB+.

Molecular descriptor A numerical value that encodes a physicochemical property of a molecule (e.g., molecular weight, lipophilicity, polar surface area).

Morgan fingerprint A circular fingerprint that encodes molecular substructures as a binary vector. Useful for measuring structural similarity between molecules.

Route 023: Predicting BBB Permeability

RouteID: 023
Wall: The Machine Learning Offwidth (W06)
Grade: 5.10d
Routesetters: Adrian & Abhiram
Time: ~40 minutes
Dataset: B3DB (Blood-Brain Barrier Database)