CHEM 169/269 Climbing Gym

Proposal Template

1) Scientific Problem / Question
- [1-2 paragraphs on the exact question and domain area]

2) Why I Chose This Problem
- [Short motivation: research interest, personal curiosity, or AI/chatbot discovery]

3) Exact Dataset(s)
- Dataset name:
- Source link:
- Number of samples / structures / molecules:
- Key fields used:
- Why this dataset is feasible for the final timeline:

4) Computational Approach
- Core method(s):
- Baseline or comparison:
- Ambitious stretch goal (e.g., contrastive learning or advanced model extension):
- Expected outputs:

5) Final Route + Solution Plan
- What your custom route asks you to do in your own first ascent:
- What my own full solution will include:

6) Risks and Backup Plan
- Main risk:
- Backup dataset/method if needed:

Long Detailed Proposal Example

1) Scientific Problem / Question

I want to predict blood-brain barrier (BBB) permeability from small-molecule representations and compare simple baseline models against embedding-based models. This sits at the intersection of biochemistry and medicinal chemistry because BBB permeability is central for CNS drug design.

2) Why I Chose This Problem

I am interested in drug discovery workflows and want to understand how molecular representation choices affect model behavior. I also explored this direction in earlier routes and want to push it into a project I can reuse in my research portfolio.

3) Exact Dataset(s)

Dataset name: BBB permeability dataset (binary label: permeable/non-permeable)
Source link (example): MoleculeNet BBBP (DeepChem docs)
Size: ~2,000 compounds (exact count depends on cleaning/filtering)
Key fields: SMILES, permeability label
Feasibility: small enough to run in Colab within the exam-time window

4) Computational Approach

Baseline: logistic regression on Morgan fingerprints
Comparison: simple embedding-based classifier (or tree-based model)
Ambitious stretch goal: test a contrastive-learning flavored setup (or another stronger representation-learning extension) and compare it against baseline performance
Evaluation: ROC-AUC, PR-AUC, confusion matrix
Expected outputs: model comparison table + one or two key plots

5) Final Route + Solution Plan

Route steps: data loading/cleaning, feature generation, baseline model, comparison model, interpretation
Final solution: complete notebook with figures, metrics, and short interpretation of failure modes

6) Risks and Backup Plan

Main risk: noisy labels or strong class imbalance
Backup: simplify to one robust baseline model and focus on interpretation + error analysis

Quick Example Ideas (Brief)

Use these as starting points. Keep your scope tight and dataset choices explicit.

1) NAD vs NADP Cofactor Preference Classifier

Build a protein-level classifier that predicts whether an enzyme prefers NAD or NADP from sequence-derived features or embeddings.

Stretch idea: evaluate generalization with a stricter split (for example, by sequence similarity cluster) and compare against a simple baseline.

2) Protein Function Labeling from PLM Embeddings

Use pretrained protein embeddings to classify a functional label (for example, enzyme class or localization) and compare with non-embedding features.

Stretch idea: add contrastive-style representation tuning or a harder class-imbalance analysis.

3) Materials/MD Property Prediction Mini-Route

Predict a property from simulation or materials descriptors using one baseline and one stronger model, then interpret failure cases.

Stretch idea: include uncertainty estimates or stability-focused error analysis.

Keep proposals concrete and scoped. The most important part is naming the exact dataset(s) and showing the plan is feasible.