Function Junction

RouteID: 004
Wall: Getting Comfortable with Python
Grade: 5.8
Route-setter: Adrian
Time: ~45-60 minutes
Dataset: mtb_uniprot_FULL.csv

Why this route exists

In previous routes, you walked through data step-by-step. Now we level up.

For example, in Applied AI, models often speak math, not English. A neural network sometimes doesn’t understand the string "Cytoplasm". But it can understand the number 1 (for "inside cell"). Often your model doesn’t compute over a raw DNA sequence, but it can take as input the molecular weight.

In this route’s exercises, we’ll learn and practice using functions. In these exercises we’ll use functions as our translators. They are custom tools you build to turn raw biological text into numerical features for your model. But you can build functions that manipulate stuff in any way you want.

What you'll build

You will write three custom functions and apply them to the M. tuberculosis proteome to create new columns:

The Estimator: Calculates protein mass.
The Labeler: Converts text to True/False.
The Parser: Extracts clean Gene IDs.

Exercise 0: The Knot Check (Setup & Syntax)

Goal: Before we climb, let's get our gear ready and practice the basic motion.

A. Clip In (Get Data)

Download mtb_uniprot_FULL.csv from the Google Drive link. → LINK
Upload it to your notebook (Colab: folder icon $\rightarrow$ upload icon).
Run this setup cell:

import pandas as pd
# Load the proteome
df = pd.read_csv("mtb_uniprot_FULL.csv")
print("Dataset loaded. Rows:", len(df))

B. The Anatomy of a Function A function is a reusable machine. It needs three things:

def: The definition (the name).
Indentation: The code inside the function must be indented.
return: The result that comes out. (Printing is NOT returning!)

The Example (Read this):

def my_function():
    # This is indented
    if condition:
        # This is double indented (because of the if statement... more on that in other routes)
        print("Hello")

Your Turn (Do this):

Write a function called double_it that takes a number x and returns x * 2.
The Belay Check:

assert double_it(10) == 20
print("Knot Check Passed! Ready to climb.")

Exercise 1: The Estimator (Simple Math)

Goal: Write a function estimate_mass(seq) that returns the approximate mass in kDa.

Step 1: Scout the Route (Look at the Data)

Before writing code, let's look at a real protein to know what we are dealing with.

# Look at the first protein in our dataframe
first_seq = df.loc[0, "Sequence"]
print("First sequence length:", len(first_seq))

Note: The first protein (Mycolipanoate synthase) is 2085 amino acids long.
Napkin Math: Average AA mass $\approx 110$ Da. So, $2085 \times 110 = 229,350$ Da.
Convert to kDa: Divide by 1000 $\rightarrow$ ~229.35 kDa.

Step 2: Write the Function

Now, write a tool that does that math for you.

The Beta (Tools):

len(s) gives the length of string s.
Handle NaN (empty data) safely.


def estimate_mass(seq_string):
    """
    Input: seq_string (str) - e.g., "MVArg..."
    Output: float - mass in kDa
    """
    # Safety check: if data is missing, return 0.0
    if pd.isna(seq_string):
        return 0.0

    # Your code here:
    # 1. Get length of seq_string
    # 2. Multiply by 110 (Daltons)
    # 3. Divide by 1000 (to get kDa)
    # 4. RETURN the result

Step 3: The Belay Check (Verify with Real Data)

Does your function match our napkin math for the first protein?


# Test on the real first sequence we pulled earlier
mass = estimate_mass(first_seq)
print(f"Calculated mass: {mass} kDa")

# It should be 229.35. Let's assert it's close.
assert abs(mass - 229.35) < 0.1
print("Exercise 1 Safe!")

Exercise 2: The Classifier (Making Decisions)

Goal: Write a function is_membrane_protein(text_input) that checks the Subcellular location column. It should return True if it's a membrane protein, False otherwise.

Step 1: Scout the Data

Check the location of the first protein:


print(df.loc[0, "Subcellular location [CC]"])

Output: "Cell membrane; Lipid-anchor..." Result: This should be labeled True.

Step 2: Write the Function

The Beta (Tools):

Checking contents: Use if "word" in text:.
Case sensitivity: "Membrane" is not the same as "membrane". Use .lower() to be safe.

Step 3: The Belay Check


# Test on the first protein (should be True)
assert is_membrane_protein(df.loc[0, "Subcellular location [CC]"]) == True

# Test on a known negative
assert is_membrane_protein("Cytoplasm") == False
print("Exercise 2 Safe!")

Exercise 3: The Parser (The Crux)

Goal: The Gene Names column is messy: "msl3 pks3 pks4 Rv1180". We need just the ID that starts with "Rv".

Step 1: Scout the Data

print(df.loc[0, "Gene Names"])

Output: "msl3 pks3 pks4 Rv1180/Rv1181"

Goal: We want to extract "Rv1180/Rv1181" (or just the first "Rv" word found).

Step 2: Write the Function

The Beta (Tools):

.split(): turns "a b c".split() into ['a', 'b', 'c']
.startswith(): "Rv123".startswith("Rv") gives as ouput "True"

The Strategy:

first split the string, then loop through the list, then check if it starts with "Rv", thenreturn it.

def extract_rv_id(gene_string):
    if pd.isna(gene_string):
        return ""

    # Step 1: Split string into words
    words = gene_string.split()

    # Step 2: Loop through words
    for w in words:
        # Step 3: Check if starts with "Rv"
        if w.startswith("Rv"):
            return w  # Found it! Return immediately.

    return "" # Return empty if no Rv ID found

Step 3: The Belay Check

# Test on the first protein
first_genes = df.loc[0, "Gene Names"]
assert extract_rv_id(first_genes).startswith("Rv")
print("Exercise 3 Safe!")

Exercise 4: The Send (Applying to Data)

Goal: Use your 3 tools on the full dataset (4,000+ proteins).

The Beta:

Use .apply(function_name) on a pandas column. Note: No parentheses () after the function name! Do:


# 1. Mass Feature
df["mass_kda"] = df["Sequence"].apply(estimate_mass)

# 2. Membrane Label
df["is_membrane"] = df["Subcellular location [CC]"].apply(is_membrane_protein)


# 3. Clean ID
df["rv_id"] = df["Gene Names"].apply(extract_rv_id)

# INSPECT YOUR WORK
print(df[["rv_id", "mass_kda", "is_membrane"]].head())

Exercise 5: Anchor Challenge (Optional)

Grade: 5.10a Goal: Create a "Cysteine Score" feature. Cysteine (C) is a special amino acid used for stability (disulphide bonds).

Write a function get_cys_count(seq) that counts "C"s in a string.
- Hint: Strings have a .count("X") method.
Apply it to make a cys_count column.
Sort (df.sort_values("cys_count", ascending=False)) to find the protein with the most Cysteines.
Logbook Question: What is that protein's name? Does it make sense biologically that it has so many cysteines?

Deliverables

Please submit the following two items:

1. A completed Jupyter notebook (.ipynb)

The notebook should run top-to-bottom without errors.
It should include your code and any brief comments you added while working.
Please follow this file naming convention → lastname_firstname_RID_003_code.ipynb
- The RID stands for "Route ID". This would be route #003.

How to download from Google Colab:

In Colab, click File → Download → Download .ipynb
This will save the notebook to your computer.

2. A short logbook entry (plain text, ~5-10 sentences):

Briefly describe:
- what was tricky or confusing
- what helped you get unstuck
- one thing you learned about working with real data
File naming convention → lastname_firstname_RID_003_logbook.txt
Focus on clarity and completeness.

Submission

Submit your files by uploading them to this Google Form: SUBMIT LINK

Please upload both:

your .ipynb notebook
your logbook file

Make sure filenames follow the naming conventions above. We will fine-tune our submission system as the course moves along. Thank you for your patience as a valued member of the CHEM 169/269 Climbing Gym.