Route 009: From SMILES to Molecular Properties

RouteID: 009
Wall: Small Molecule Representations
Grade: 5.10b
Routesetter: Ricardo + Sarah
Time: ~unsure at the moment
Dataset: FreeSolv (experimental solubility data)

Why this route exists

When working with molecular datasets, structural information is often converted into numerical descriptors.

In this route, you’ll take molecules written as SMILES strings, convert them into RDKit molecule objects, and compute a small set of physicochemical descriptors. You’ll then explore how those descriptors relate to experimental solubility.

What you’ll be able to do after this route

By the end, you can:

Parse SMILES strings into RDKit Mol objects
Safely handle invalid or missing molecular data
Compute basic molecular descriptors
Generate fingerprints and fingerprint density
Visualize descriptor distributions
Reason about structure → property relationships

Key definitions

SMILES (Simplified Molecular Input Line Entry System) A compact text format for writing molecules using atoms, bonds, branching, and ring closures. Examples: CCO → ethanol c1ccccc1 → benzene

RDKit An open-source Python toolkit for cheminformatics. It parses SMILES, builds molecular graphs, computes descriptors, and generates fingerprints.

Mol object (RDKit Mol) RDKit’s internal representation of a molecule. Once you have a Mol object, you can compute properties from it.

Descriptor A numerical summary of a molecular property (e.g., molecular weight, LogP).

Fingerprint A bit vector encoding which substructures are present in a molecule.

Fingerprint density A normalized measure of how many fingerprint features a molecule has relative to its size.

Exercise 0: The Knot Check (Setup & Data)

Goal: Get the FreeSolv dataset loaded and understand what you’re working with. FreeSolv contains the solvation energies of small molecules.

Download the FreeSolv dataset.
Upload it into your Jupyter or Colab notebook.
Load it using pandas.read_csv (skip the first two rows).
Identify the column containing SMILES strings.
Print a few example SMILES.

Success check:

The dataframe loads without errors.
You can clearly identify valid SMILES strings.

Exercise 1: SMILES → Mol Objects

Goal: Convert SMILES strings into RDKit molecule objects.

Import RDKit (Chem).
Write a function that:
1. Takes a SMILES string
2. Converts it into an RDKit Mol object
3. Returns None if the SMILES is invalid
Apply this function to the SMILES column.

Tools:

Small RDKit tutorial
Chem.MolFromSmiles
- Always assume some entries may be missing or malformed

Success check:

QUESTION
- How would you explain the difference between a SMILES string and a Mol object to a classmate?
Valid SMILES produce Mol objects.
Invalid entries do not crash your notebook.

Common fall:

Assuming every SMILES converts to a valid molecule.

Exercise 2: Molecular Mass (Size)

Goal: Compute molecular weight for each molecule.

Write a function that takes a Mol object.
Use RDKit to compute molecular weight.
Store the result in a new column called "MM".

Tools:

rdkit.Chem.Descriptors.MolWt
If the Mol is None, return 0.0

Success check:

Larger molecules have larger MM values.
No negative or obviously incorrect values appear.

Exercise 3: Fingerprints & Fingerprint Density

Goal: Quantify molecular structure complexity.

Generate Morgan fingerprints (radius = 1).
Count the number of “on” bits in each fingerprint.
Normalize by molecular size to compute fingerprint density.
Store the result in a column called "FPM".

Tools:

AllChem.GetMorganFingerprintAsBitVect
fp.GetNumOnBits()

Success check:

QUESTIONS
- Why does normalizing fingerprint counts by size matter?
- How could two molecules with similar mass still have very different fingerprint densities?
FPM values are finite and comparable.
Molecules of similar size can still have different densities.

Common fall:

Using raw bit counts without normalization.

Exercise 4: LogP (Hydrophobicity)

Goal: Estimate molecular hydrophobicity.

Compute LogP using RDKit’s Crippen method.
Store the result in a column called "LogP".

Tools:

rdkit.Chem.Crippen.MolLogP

Success check:

Hydrophobic molecules have higher LogP values.
Polar molecules have lower LogP values.

Exercise 5: Reading the Rock (Structure vs Solubility)

Goal: Build intuition about how molecular properties relate to solubility.

Plot experimental solubility vs:
1. Molecular mass
2. Fingerprint density
3. LogP
Plot histograms of:
1. MM
2. FPM
3. LogP
4. Experimental solubility
In your Jupiter notebook, answer:
1. Which descriptor appears most related to solubility?
2. Which relationships are noisy or weak?
3. What molecular features might matter that we didn’t encode?

There is no single correct answer. Clear reasoning matters more than conclusions.

Success check:

You can describe at least one visible trend.
You can explain why that trend makes chemical sense.

Deliverables

Please submit the following two items:

A completed Jupyter notebook (.ipynb)
1. The notebook should run top-to-bottom without errors.
2. It should include your code and any brief comments you added while working.
File naming convention → lastname_firstname_RID_009_code.ipynb
A short logbook entry (plain text, ~5–10 sentences)
1. Briefly describe:
  1. What was tricky or confusing
  2. What helped you get unstuck
  3. One thing you learned about working with real data
File naming convention → lastname_firstname_RID_009_logbook.txt

Submission

Submit your files by uploading them in this submission link: SUBMIT LINK

Please upload both:

your .ipynb notebook
your logbook file

Make sure filenames follow the naming conventions above. Good job completing this route!