Navigate
Back to Gym
← Back to Wall

Reading: Explore Codebases and Extract Data

Route ID: R038B • Wall: W11 • Released: Mar 9, 2026

5.9
published

🎉 Sent!

You made it to the top. Submit your work above!

Submission

Submit your work here


Deliverable

Submit your exploration answers and extracted data. Your submission might include additional analysis, screenshots, or experiments — that's all great.

At minimum, your submission must have:

  1. Answers to the 5 codebase exploration questions (Exercise 2)
  2. The merged CSV from your data extraction challenge (Exercise 4)
  3. The prompts you used for both tasks

Mission checklist

  • Explored a real codebase with Claude Code
  • Answered exploration questions
  • Extracted and merged data from multiple files
  • Documented your prompts

Exercise 4: Data Extraction Challenge

Now apply everything to extract real data from messy files.

Option A: Practice dataset (standalone)

Download the practice files: sample_experiment_data.zip

This zip contains 10 JSON files with experiment results in nested format. Your task:

> I have experiment results in /path/to/extracted/folder.
> Each JSON file has: sample_id, metadata.date, results.measurement, results.error
> Please:
> 1. Find all JSON files
> 2. Extract sample_id, date, measurement, and error from each
> 3. Output a merged CSV sorted by date

Option B: AlphaFold3 outputs (if you've done Routes 37A/37B)

If you have AF3 results from the protein-protein interactions routes:

> I have AF3 results in /path/to/extracted/folder.
> Each job folder contains a summary_confidences_0.json file (ignore _1, _2, _3, _4).
> Please:
> 1. Find all the model 0 summary files
> 2. Extract iptm, ptm, and chain_pair_pae_min from each
> 3. Parse the pair ID from the folder name (look for pos_XX or dec_XX)
> 4. For pae_off_diag_avg, average the off-diagonal values [0][1] and [1][0]
> 5. Output a merged CSV with: pair_id, type, iptm, ptm, pae_off_diag_avg

Why this matters

What would take 30+ minutes of manual work becomes a 30-second conversation. The agent:

  • Finds the right files automatically
  • Handles nested JSON structures
  • Parses metadata from filenames
  • Outputs clean, analysis-ready data

Trust but verify: Spot-check a few rows manually to confirm the extraction worked correctly.


Exercise 3: Data Extraction Patterns

Beyond navigation, agents excel at extracting data from messy file structures.

Pattern 1: Extract from multiple files

> I have JSON files in /path/to/data/. Each file has "sample_id",
> "measurement", and "timestamp". Extract these from all files
> and create a single CSV.

Pattern 2: Parse filenames for metadata

Often the filename itself contains important info:

> The filenames are like "exp_2024_03_15_sample_A_trial_1.json".
> Parse the date, sample name, and trial number from each filename
> and include them as columns in the output.

Pattern 3: Navigate nested structures

> The JSON files have nested structure: results.scores.accuracy
> and results.scores.precision. Extract both values from each file.

Pattern 4: Filter while extracting

> Extract all rows where status is "completed" and score > 0.8.
> Skip any files that don't have a valid score field.

When to use agents for data work

Good for:

  • Extracting from many files
  • Parsing messy formats
  • Format conversions
  • Quick exploratory analysis

⚠️ Be careful with:

  • Complex statistical analysis (verify the approach)
  • Production pipelines (review the code)
  • Anything where subtle bugs would be hard to notice

Exercise 2: Exploration Challenge

Use Claude Code to answer these questions about your repository:

  1. Structure: "What are the main directories and what does each contain?"
  2. Entry point: "Where does the code start executing? What's the main file?"
  3. Dependencies: "What external libraries does this project use?"
  4. Specific search: "Find all functions that handle [topic relevant to the repo]"
  5. Understanding: "Explain what [specific file] does in simple terms"

Tips for better exploration

Be specific about what you're looking for:

> Find all functions that make HTTP requests
> Which files import the database module?
> Show me the error handling patterns in this codebase

Ask for explanations at the right level:

> Explain this like I'm new to Python
> Give me a technical deep-dive on how the caching works
> Summarize in 2-3 sentences

Record your prompts and answers — you'll submit these.


Exercise 1: Understanding Project Structure

When you open a new codebase, the first question is: "What's going on here?"

Start with the big picture

> Give me an overview of this project. What does it do and how is it organized?

The agent will:

  1. Read the README
  2. Look at the directory structure
  3. Scan key files
  4. Give you a coherent summary

Drill down

> What's in the src/ directory?
> Explain the purpose of each file in src/
> What are the main entry points?

Search for specific things

> Find all Python files that contain the word "database"
> Where is the configuration file?
> Show me all test files

Understand code

> Read src/utils.py and explain what each function does
> How does the authentication flow work?
> What design patterns does this codebase use?

This is dramatically faster than manually opening files and trying to understand them.


Exercise 0: Setup

Prerequisites

  • Claude Code installed and working — complete R038A first
  • Git — check with git --version

Clone a test repository

Pick something to explore:

# Option A: A popular Python library
git clone https://github.com/psf/requests
cd requests

# Option B: A web framework
git clone https://github.com/pallets/flask
cd flask

# Option C: Any project you're curious about
git clone https://github.com/[owner]/[repo]
cd [repo]

Start Claude Code

claude

You're now ready to explore.

What you'll learn

This route covers two core "reading" skills:

  1. Codebase navigation — understanding project structure, finding files, searching code
  2. Data extraction — pulling structured data from messy file collections

Both use the same underlying ability: directing the agent to read, interpret, and summarize information.


Intro

Coding agents are exceptionally good at reading — understanding codebases, finding information, and extracting data from messy files.

Instead of manually opening files, grepping for patterns, and piecing together how things work, you can just ask:

"What does this project do and how is it organized?"

"Find all the JSON files and extract the scores into a CSV."

The agent reads files, understands structure, and gives you coherent answers or clean data.

Real-world application: In Routes 37A/37B, you download messy AlphaFold3 outputs with dozens of files per job. Instead of manually hunting for the right JSON files, you can ask the agent: "Find all summary_confidences_0.json files and extract the ipTM scores." What takes 30 minutes manually becomes a 30-second conversation.

Note: These skills work with any coding agent (Cursor, Copilot, Aider) — not just Claude Code.


Route 038B: Reading — Explore and Extract

  • RouteID: 038B
  • Wall: AI-Assisted Coding (W11)
  • Grade: 5.9
  • Routesetter: Adrian
  • Time: ~40-50 minutes
  • You'll need: Claude Code installed (R038A), a repository to explore, practice data files

🧗 Base Camp

Start here and climb your way up!