Proposal Template
1) Scientific Problem / Question - [1-2 paragraphs on the exact question and domain area] 2) Why I Chose This Problem - [Short motivation: research interest, personal curiosity, or AI/chatbot discovery] 3) Exact Dataset(s) - Dataset name: - Source link: - Number of samples / structures / molecules: - Key fields used: - Why this dataset is feasible for the final timeline: 4) Computational Approach - Core method(s): - Baseline or comparison: - Ambitious stretch goal (e.g., contrastive learning or advanced model extension): - Expected outputs: 5) Final Route + Solution Plan - What your custom route asks you to do in your own first ascent: - What my own full solution will include: 6) Risks and Backup Plan - Main risk: - Backup dataset/method if needed:
Long Detailed Proposal Example
1) Scientific Problem / Question
I want to predict blood-brain barrier (BBB) permeability from small-molecule representations and compare simple baseline models against embedding-based models. This sits at the intersection of biochemistry and medicinal chemistry because BBB permeability is central for CNS drug design.
2) Why I Chose This Problem
I am interested in drug discovery workflows and want to understand how molecular representation choices affect model behavior. I also explored this direction in earlier routes and want to push it into a project I can reuse in my research portfolio.
3) Exact Dataset(s)
- Dataset name: BBB permeability dataset (binary label: permeable/non-permeable)
- Source link (example): MoleculeNet BBBP (DeepChem docs)
- Size: ~2,000 compounds (exact count depends on cleaning/filtering)
- Key fields: SMILES, permeability label
- Feasibility: small enough to run in Colab within the exam-time window
4) Computational Approach
- Baseline: logistic regression on Morgan fingerprints
- Comparison: simple embedding-based classifier (or tree-based model)
- Ambitious stretch goal: test a contrastive-learning flavored setup (or another stronger representation-learning extension) and compare it against baseline performance
- Evaluation: ROC-AUC, PR-AUC, confusion matrix
- Expected outputs: model comparison table + one or two key plots
5) Final Route + Solution Plan
- Route steps: data loading/cleaning, feature generation, baseline model, comparison model, interpretation
- Final solution: complete notebook with figures, metrics, and short interpretation of failure modes
6) Risks and Backup Plan
- Main risk: noisy labels or strong class imbalance
- Backup: simplify to one robust baseline model and focus on interpretation + error analysis
Quick Example Ideas (Brief)
Use these as starting points. Keep your scope tight and dataset choices explicit.
1) NAD vs NADP Cofactor Preference Classifier
Build a protein-level classifier that predicts whether an enzyme prefers NAD or NADP from sequence-derived features or embeddings.
Stretch idea: evaluate generalization with a stricter split (for example, by sequence similarity cluster) and compare against a simple baseline.
2) Protein Function Labeling from PLM Embeddings
Use pretrained protein embeddings to classify a functional label (for example, enzyme class or localization) and compare with non-embedding features.
Stretch idea: add contrastive-style representation tuning or a harder class-imbalance analysis.
3) Materials/MD Property Prediction Mini-Route
Predict a property from simulation or materials descriptors using one baseline and one stronger model, then interpret failure cases.
Stretch idea: include uncertainty estimates or stability-focused error analysis.
Keep proposals concrete and scoped. The most important part is naming the exact dataset(s) and showing the plan is feasible.