Route: DataFrames & Cleaning

RouteID: 001
Wall: Getting Comfortable with Python
Routesetter: Sidd + Adrian
Date: 01/07/2026
Grade: 5.7

Introduction

This route introduces the core data-cleaning operations you will use in nearly every real ML or computational biology project.

Why this route exists Raw biological datasets are messy. Before modeling, visualization, or interpretation, you must learn to inspect, clean, and reason about real data. Having a handle on the basic commands and functionality of the Pandas library is important and super useful.

In this route you will:

load a CSV into a DataFrame
inspect the structure and detect issues
filter rows using conditions
select & rename columns
deduplicate rows
explore and handle missing values
ask and answer simple questions using cleaning techniques

These skills form the backbone of real-world data preparation.

Dataset

You will work with a curated subset of protein annotations from UniProt for Mycobacterium tuberculosis: the deadliest bacteria on Earth. It kills on the order of 1 Million people a year from tuberculosis (TB). It causes a ton of suffering.

The best introductory book to TB is John Green’s (the John Green from e.g. the novel/movie The Fault in our Stars) latest: Everything Is Tuberculosis.

Each row corresponds to a protein and includes basic annotation fields such as:

protein name
gene name
organism
protein length
functional and subcellular annotations

Important: This dataset is intentionally messy. It contains duplicate rows and missing values by design, so you can practice real-world data cleaning operations. You are not expected to understand the biology in detail — treat annotation fields as structured text for now.

Exercises

Exercise 0. Dataset access (for today)

The dataset file is named: mtb_uniprot_subset.csv

Download it here → DOWNLOAD LINK

How to upload it in Google Colab:

Download the CSV using the link above.
Open your Colab notebook.
Click the folder icon on the le ft sidebar.
Click Upload and select mtb_uniprot_subset.csv.
Load it in Python (next exercise).

Exercise 1. Load the dataset

Use pandas.read_csv() to load the CSV into a DataFrame.
Display the first 5 rows with .head().

Exercise 2. Inspect the DataFrame

Call .info() to check types and missing values.
Use .describe() (with include='all') to summarize.
Count unique values in selected columns.

Question: Do any column(s) contain missing values?

Exercise 3. Filter rows

Examples:

filter rows where number of amino acids (length of protein) is greater than some number.
filter by category: for example, the proteins belong to the reviewed category.
combine conditions → two conditions are met at the same time.

Bonus: Use .query() for the same filters.

Exercise 4. Select and rename columns

Subset columns: df[["Entry", "Length"]]
Rename with .rename() → for example: “UniProt ID”, “number-AAs”
Create new derived columns (optional)

Exercise 5. Deduplicate rows

Remove duplicate full rows
Remove duplicates based on specific columns (e.g., id)
Count duplicates before and after

Question: After deduplication, how many rows remain?

Exercise 6. Missing values

Check missing value counts: .isna().sum()
Fill null values using:
- .fillna()
- median, mean, mode
- forward-fill/backward-fill
Optionally, drop rows: .dropna()

Bonus: Compare different cleaning strategies.

Exercise 7. Ask & answer your own question

Examples:

What is the distribution or histogram of lengths? What’s the average protein length? Standard deviation?
What’s the name of the biggest protein in the dataset?
What are those annotation scores (you’ll need to read up in UniProt). What’s their distribution.

Students provide code + a brief reflection.

Congrats you sent the Dataframes and Cleaning route!!

Deliverables

Please submit the following two items:

1. A completed Jupyter notebook (.ipynb)

The notebook should run top-to-bottom without errors.
It should include your code and any brief comments you added while working.
Please follow this file naming convention → lastname_firstname_RID_001_code.ipynb
- The RID stands for “Route ID”. This would be route #001.
How to download from Google Colab:
In Colab, click File → Download → Download .ipynb
This will save the notebook to your computer.

2. A short logbook entry (plain text, ~5–10 sentences):

Briefly describe:
- what was tricky or confusing
- what helped you get unstuck
- one thing you learned about working with real data
File naming convention → lastname_firstname_RID_001_logbook.txt
Focus on clarity and completeness.

Submission

For now, submit your files by uploading them to this Google Form:

[LINK HERE]

Please upload both: - your .ipynb notebook - your logbook file

Make sure filenames follow the naming conventions above.

We will fine-tune our submission system as the course moves along. Thank you for your patience as a valued member of the CHEM 169/269 Climbing Gym. ✌️✌️