Navigate
Back to Gym
← Back to Wall

DataFrames and Cleaning

Route ID: R001 • Wall: W01 • Released: Jan 1, 2026

5.7
ready

🧗 Start Here

Scroll down to complete this route

Route: DataFrames & Cleaning

  • RouteID: 001
  • Wall: Getting Comfortable with Python
  • Routesetter: Sidd + Adrian
  • Date: 01/07/2026
  • Grade: 5.7

Introduction

This route introduces the core data-cleaning operations you will use in nearly every real ML or computational biology project.

Why this route exists Raw biological datasets are messy. Before modeling, visualization, or interpretation, you must learn to inspect, clean, and reason about real data. Having a handle on the basic commands and functionality of the Pandas library is important and super useful.

In this route you will:

  • load a CSV into a DataFrame
  • inspect the structure and detect issues
  • filter rows using conditions
  • select & rename columns
  • deduplicate rows
  • explore and handle missing values
  • ask and answer simple questions using cleaning techniques

These skills form the backbone of real-world data preparation.

Dataset

You will work with a curated subset of protein annotations from UniProt for Mycobacterium tuberculosis: the deadliest bacteria on Earth. It kills on the order of 1 Million people a year from tuberculosis (TB). It causes a ton of suffering.

The best introductory book to TB is John Green’s (the John Green from e.g. the novel/movie The Fault in our Stars) latest: Everything Is Tuberculosis.

Each row corresponds to a protein and includes basic annotation fields such as:

  • protein name
  • gene name
  • organism
  • protein length
  • functional and subcellular annotations

Important: This dataset is intentionally messy. It contains duplicate rows and missing values by design, so you can practice real-world data cleaning operations. You are not expected to understand the biology in detail — treat annotation fields as structured text for now.

Exercises

Exercise 0. Dataset access (for today)

The dataset file is named: mtb_uniprot_subset.csv

Download it here → DOWNLOAD LINK

How to upload it in Google Colab:

  1. Download the CSV using the link above.
  2. Open your Colab notebook.
  3. Click the folder icon on the le ft sidebar.
  4. Click Upload and select mtb_uniprot_subset.csv.
  5. Load it in Python (next exercise).

Exercise 1. Load the dataset

  • Use pandas.read_csv() to load the CSV into a DataFrame.
  • Display the first 5 rows with .head().

Exercise 2. Inspect the DataFrame

  • Call .info() to check types and missing values.
  • Use .describe() (with include='all') to summarize.
  • Count unique values in selected columns.

Question: Do any column(s) contain missing values?

Exercise 3. Filter rows

Examples:

  • filter rows where number of amino acids (length of protein) is greater than some number.
  • filter by category: for example, the proteins belong to the reviewed category.
  • combine conditions → two conditions are met at the same time.

Bonus: Use .query() for the same filters.

Exercise 4. Select and rename columns

  • Subset columns: df[["Entry", "Length"]]
  • Rename with .rename() → for example: “UniProt ID”, “number-AAs”
  • Create new derived columns (optional)

Exercise 5. Deduplicate rows

  • Remove duplicate full rows
  • Remove duplicates based on specific columns (e.g., id)
  • Count duplicates before and after

Question: After deduplication, how many rows remain?

Exercise 6. Missing values

  • Check missing value counts: .isna().sum()
  • Fill null values using:
    • .fillna()
    • median, mean, mode
    • forward-fill/backward-fill
  • Optionally, drop rows: .dropna()

Bonus: Compare different cleaning strategies.

Exercise 7. Ask & answer your own question

Examples:

  • What is the distribution or histogram of lengths? What’s the average protein length? Standard deviation?
  • What’s the name of the biggest protein in the dataset?
  • What are those annotation scores (you’ll need to read up in UniProt). What’s their distribution.

Students provide code + a brief reflection.

Congrats you sent the Dataframes and Cleaning route!!

Deliverables

Please submit the following two items:

1. A completed Jupyter notebook (.ipynb)

  • The notebook should run top-to-bottom without errors.

  • It should include your code and any brief comments you added while working.

  • Please follow this file naming convention → lastname_firstname_RID_001_code.ipynb

    • The RID stands for “Route ID”. This would be route #001.

    How to download from Google Colab:

  • In Colab, click File → Download → Download .ipynb

  • This will save the notebook to your computer.

2. A short logbook entry (plain text, ~5–10 sentences):

  • Briefly describe:
    • what was tricky or confusing
    • what helped you get unstuck
    • one thing you learned about working with real data
  • File naming convention → lastname_firstname_RID_001_logbook.txt
  • Focus on clarity and completeness.

Submission

For now, submit your files by uploading them to this Google Form:

[LINK HERE]

Please upload both: - your .ipynb notebook - your logbook file

Make sure filenames follow the naming conventions above.

We will fine-tune our submission system as the course moves along. Thank you for your patience as a valued member of the CHEM 169/269 Climbing Gym. ✌️✌️

🎉 Route Complete!

Great work!