🧗 Start Here
Scroll down to complete this route
Route: The UniProt Topo Guide
- RouteID: R013
- Wall: Protein Representations
- Grade: 5.7
- Routesetter: Sarah
- Time: ~25 minutes
- Dataset: UniProt Reference Proteomes
Why this route exists
When working with proteins at scale, you rarely start from a single sequence. Instead, you work with thousands of proteins drawn from an organism, a pathway, or an entire proteome.
UniProt is the primary map for this space. It organizes protein sequences, annotations, and metadata into a consistent, searchable framework that underlies most modern bioinformatics and applied ML workflows.
In this route, you'll learn how to navigate UniProt confidently: exploring reference proteomes, interpreting protein tables, inspecting individual entries, and downloading curated datasets for analysis.
What you'll be able to do after this route
By the end, you can:
- Navigate UniProtKB and the Reference Proteomes portal
- Explain the difference between reviewed and unreviewed entries
- Explore and customize UniProt protein tables
- Inspect individual protein entries and interpret annotations
- Download FASTA and TSV datasets from UniProt
- Load UniProt data into Python for basic exploration
Student background assumptions
You are expected to:
- Be familiar with protein sequences at a basic level
- Have no prior experience with UniProt or bioinformatics databases
- Be comfortable exploring websites and downloading files
- Have no prior knowledge of proteomes, annotation pipelines, or identifiers
This route is designed for beginners.
Key definitions (read once, then explore)
UniProtKB The UniProt Knowledgebase. Contains protein sequences and annotations.
Reviewed (Swiss-Prot) Entries that have been manually curated by experts.
Unreviewed (TrEMBL) Automatically annotated entries that have not yet been manually reviewed.
Reference proteome A representative, well-annotated proteome for a species, selected to serve as a standard reference.
Annotation score A summary measure of how well-annotated a protein entry is.
Exercise 0: The Knot Check (Orientation)
Goal: Get oriented within UniProt and understand what kind of resource it is.
- Open the UniProt website.
- Create a text file (Word, Docs, text, etc.)
- In this file you will write your answers to the questions posed in each exercise.
- PLEASE! Write the answers YOURSELF using YOUR own words!! Feel free to format your answers however you want!
- Title it "lastname_firstname_RID_013_text_submission"
QUESTIONS
- What is Uniprot? Give examples of what is stored in the database
Exercise 1: Explore the Reference Proteomes Portal
Goal: Understand what a reference proteome represents.
- Locate UniProtKB and the Reference Proteomes portal. (Red box - Species, Proteomes)
- Choose a well-known organism (e.g., E. coli K-12 or M. tuberculosis H37Rv) (click on the proteome id)
- Record basic information:
- Proteome id
- Proteome size
- Taxonomic lineage
- Any available quality metrics (e.g., BUSCO scores)
QUESTIONS
- Why do you think UniProt defines "reference" proteomes instead of treating all proteomes equally?
- What percentage of reference proteomes belong to Eukaryota, how many belong to bacteria? Is this expected, why or why not?
Success check:
- You can explain what makes a proteome a "reference" proteome.
Exercise 2: View and Customize Protein Tables
Goal: Learn how to explore proteins at scale.
- Open the protein table for your chosen proteome. (click on the blue number showing the entries in UniprotKB)
- Use "Customize Columns" to include:
- Entry name
- Protein name
- Gene name
- Length
- Annotation score
- Subcellular location
- EC number
- Choose two more metrics.
- Sort and filter the table.
QUESTIONS
- What is the annotation score, and how does it vary across proteins in your proteome?
- Do all proteins have fully populated annotation tables? Give specific examples of information that is missing or incomplete.
- Which two metrics did you choose to add? What does each metric tell you about the protein?
Exercise 3: Inspect Individual Protein Entries
Goal: Learn how to read a single UniProt entry.
- Choose 2 proteins from your table.
- Open their UniProtKB entry pages.
- Create a table in your text file and record key information:
- Entry Name
- Protein Name
- Function summary
- 3D structure (yes if available or no if not, if available say the source (AlphaFold, CryoEM, etc.)
- Sequence
QUESTIONS
- What information is easy to find on a UniProt entry page? What information takes more effort to locate?
- How does knowing a protein's Entry Name help you navigate the entry page more efficiently?
- How would you decide whether a protein is "well understood" based on its UniProt entry? Using this criterion, are the proteins you examined well understood?
Exercise 4: Download Proteome Data
Goal: Move from web exploration to data analysis.
-
Move back to the protein table for your chosen proteome.
-
Download the proteome FASTA file. (not compressed)
-
Export a CSV table containing your selected fields. (not compressed)
-
If your chosen organism was not E. coli click here
-
Download the embeddings for the proteins present in E. coli proteome
- Verify you downloaded "UP000000625_83333/per-protein.h5"
- Next routes will delve deeper into the wonderful world of protein Embeddings, so no worries if you don't know what that means yet!
QUESTIONS
- What is located inside the proteome FASTA?
- What types of analyses become possible once the proteome data is loaded into Python?
- How confident do you feel navigating UniProt now (5-very confident, 0-not confident)
Submission
Upload your text file and reflection via Google Forms:
We will continue building on these skills in later routes. Thank you for being a valued member of the CHEM 169/269 Climbing Gym.
🎉 Route Complete!
Great work!