First Ascent by
Ethan Truong
โ ๏ธ Pending written proposal
The LVMOF Crystal Face
Materials Science / Metal-Organic Frameworks
The Proposed Route
A deep feature-engineering and ML pipeline predicting crystallinity of metal-organic frameworks (MOFs) from a massive feature matrix: Morgan fingerprints, RAC descriptors, ChemBERTa embeddings, TEP calculations, and more. An ordinal XGBoost classifier with SHAP interpretability identifies the molecular drivers of MOF crystalline quality.
๐ง The Crux
64,739 features for ~756 samples is an extreme high-dimensional setting โ feature selection (VarianceThreshold + SelectKBest) is doing a lot of work and its validity should be interrogated. The feature name mismatch bug already encountered in the notebook (608-feature offset) suggests the pipeline is complex enough to have subtle errors.
โ ๏ธ Pre-Climb Checklist
โ ๏ธ Written proposal not yet submitted โ required. โ Notebook is clearly advanced and well-structured. โ ๏ธ Validate that the feature selection pipeline is not inadvertently leaking test set information.
Guidance
- Submit written proposal for formal approval
- With 64K features / 756 samples: overfitting and data leakage are real risks
- SelectKBest must be fit only on training data โ report test set performance separately
- SHAP analysis is the key deliverable โ top features must make chemical sense for MOF crystallinity
- Measure AU-ROC and AU-PRC curves to evaluate model accuracy
- Consider contrastive learning to separate positives/negatives before the classification head โ see R030 for an example
Source proposal: Ethan_Truong_LVMOF_Surrogate.ipynb
CHEM 169/269 ยท Applied AI & Machine Learning for Biochemistry