🧗 Start Here
Scroll down to complete this route
Route 009: From SMILES to Molecular Properties
- RouteID: 009
- Wall: Small Molecule Representations
- Grade: 5.10b
- Routesetter: Ricardo + Sarah
- Time: ~unsure at the moment
- Dataset: FreeSolv (experimental solubility data)
Why this route exists
When working with molecular datasets, structural information is often converted into numerical descriptors.
In this route, you’ll take molecules written as SMILES strings, convert them into RDKit molecule objects, and compute a small set of physicochemical descriptors. You’ll then explore how those descriptors relate to experimental solubility.
What you’ll be able to do after this route
By the end, you can:
- Parse SMILES strings into RDKit Mol objects
- Safely handle invalid or missing molecular data
- Compute basic molecular descriptors
- Generate fingerprints and fingerprint density
- Visualize descriptor distributions
- Reason about structure → property relationships
Key definitions
SMILES (Simplified Molecular Input Line Entry System) A compact text format for writing molecules using atoms, bonds, branching, and ring closures. Examples: CCO → ethanol c1ccccc1 → benzene
RDKit An open-source Python toolkit for cheminformatics. It parses SMILES, builds molecular graphs, computes descriptors, and generates fingerprints.
Mol object (RDKit Mol) RDKit’s internal representation of a molecule. Once you have a Mol object, you can compute properties from it.
Descriptor A numerical summary of a molecular property (e.g., molecular weight, LogP).
Fingerprint A bit vector encoding which substructures are present in a molecule.
Fingerprint density A normalized measure of how many fingerprint features a molecule has relative to its size.
Exercise 0: The Knot Check (Setup & Data)
Goal: Get the FreeSolv dataset loaded and understand what you’re working with. FreeSolv contains the solvation energies of small molecules.
- Download the FreeSolv dataset.
- Upload it into your Jupyter or Colab notebook.
- Load it using pandas.read_csv (skip the first two rows).
- Identify the column containing SMILES strings.
- Print a few example SMILES.
Success check:
- The dataframe loads without errors.
- You can clearly identify valid SMILES strings.
Exercise 1: SMILES → Mol Objects
Goal: Convert SMILES strings into RDKit molecule objects.
- Import RDKit (Chem).
- Write a function that:
- Takes a SMILES string
- Converts it into an RDKit Mol object
- Returns None if the SMILES is invalid
- Apply this function to the SMILES column.
Tools:
- Small RDKit tutorial
- Chem.MolFromSmiles
- Always assume some entries may be missing or malformed
Success check:
- QUESTION
- How would you explain the difference between a SMILES string and a Mol object to a classmate?
- Valid SMILES produce Mol objects.
- Invalid entries do not crash your notebook.
Common fall:
- Assuming every SMILES converts to a valid molecule.
Exercise 2: Molecular Mass (Size)
Goal: Compute molecular weight for each molecule.
- Write a function that takes a Mol object.
- Use RDKit to compute molecular weight.
- Store the result in a new column called "MM".
Tools:
- rdkit.Chem.Descriptors.MolWt
- If the Mol is None, return 0.0
Success check:
- Larger molecules have larger MM values.
- No negative or obviously incorrect values appear.
Exercise 3: Fingerprints & Fingerprint Density
Goal: Quantify molecular structure complexity.
- Generate Morgan fingerprints (radius = 1).
- Count the number of “on” bits in each fingerprint.
- Normalize by molecular size to compute fingerprint density.
- Store the result in a column called "FPM".
Tools:
- AllChem.GetMorganFingerprintAsBitVect
- fp.GetNumOnBits()
Success check:
- QUESTIONS
- Why does normalizing fingerprint counts by size matter?
- How could two molecules with similar mass still have very different fingerprint densities?
- FPM values are finite and comparable.
- Molecules of similar size can still have different densities.
Common fall:
- Using raw bit counts without normalization.
Exercise 4: LogP (Hydrophobicity)
Goal: Estimate molecular hydrophobicity.
- Compute LogP using RDKit’s Crippen method.
- Store the result in a column called "LogP".
Tools:
- rdkit.Chem.Crippen.MolLogP
Success check:
- Hydrophobic molecules have higher LogP values.
- Polar molecules have lower LogP values.
Exercise 5: Reading the Rock (Structure vs Solubility)
Goal: Build intuition about how molecular properties relate to solubility.
- Plot experimental solubility vs:
- Molecular mass
- Fingerprint density
- LogP
- Plot histograms of:
- MM
- FPM
- LogP
- Experimental solubility
- In your Jupiter notebook, answer:
- Which descriptor appears most related to solubility?
- Which relationships are noisy or weak?
- What molecular features might matter that we didn’t encode?
There is no single correct answer. Clear reasoning matters more than conclusions.
Success check:
- You can describe at least one visible trend.
- You can explain why that trend makes chemical sense.
Deliverables
Please submit the following two items:
-
A completed Jupyter notebook (.ipynb)
- The notebook should run top-to-bottom without errors.
- It should include your code and any brief comments you added while working.
File naming convention → lastname_firstname_RID_009_code.ipynb
-
A short logbook entry (plain text, ~5–10 sentences)
- Briefly describe:
- What was tricky or confusing
- What helped you get unstuck
- One thing you learned about working with real data
- Briefly describe:
-
File naming convention → lastname_firstname_RID_009_logbook.txt
Submission
Submit your files by uploading them in this submission link: SUBMIT LINK
Please upload both:
- your .ipynb notebook
- your logbook file
Make sure filenames follow the naming conventions above. Good job completing this route!
🎉 Route Complete!
Great work!