Route 014: ESM Toy Sequences

RouteID: 014
Wall: Protein Representations (W05)
Grade: 5.6
Routesetter: Sarah
Time: 25 minutes
Dataset: Single toy protein sequence with precomputed ESM-2 outputs

Why this route exists

In this wall, you have begun working with proteins as sequences of letters and explored how biological information can be represented in structured, computational forms. At the most basic level, protein sequences can be treated as ordered lists of amino acids, where each position is handled independently.

Modern protein language models take a different approach. Instead of treating amino acids in isolation, they treat protein sequences as biological "sentences," where the meaning of each amino acid depends on its surrounding context. These models learn patterns across entire sequences, including long-range relationships that simple representations cannot capture.

ESM-2 is one such model. It uses a transformer architecture to convert a protein sequence into a set of numerical vectors called embeddings. These embeddings capture contextual information learned from large collections of protein sequences.

In this route, you will walk step-by-step through how a short protein sequence is transformed into ESM-2 embeddings. Rather than focusing on model training or performance, this route focuses on understanding the process by which a protein becomes numbers.

What you'll be able to do after this route

By the end, you can:

Explain what an embedding is and why it is useful
Describe how ESM-2 processes a protein sequence
Understand why embeddings are produced per amino acid
Explain how context changes amino acid representations
Combine residue-level embeddings into a protein-level vector

Student background assumptions

You are expected to:

Know what a protein sequence is
Be comfortable working with short sequences in a Python notebook
Have no prior experience with protein language models
Have no prior knowledge of transformers or deep learning

This route emphasizes intuition over technical detail.

Key definitions (read once, then explore)

Embedding A numerical vector that represents information learned from data.

Protein language model A model trained on large collections of protein sequences to learn patterns in sequence organization.

Transformer A neural network architecture that processes entire sequences at once and learns relationships between positions.

Self-attention A mechanism that allows each amino acid to incorporate information from other amino acids in the sequence.

ESM-2 A transformer-based protein language model that produces contextual embeddings for protein sequences.

Background: How ESM-2 Is Trained (Masked Language Modeling)

ESM-2 is trained using masked language modeling, a self-supervised learning task. During training, the model is shown hundreds of millions of protein sequences. In each sequence, a subset of amino acids is randomly replaced with a special mask token.

The model is trained to predict the original amino acid at each masked position. To do this well, it has to use information from the rest of the sequence, including both nearby amino acids and residues that are far away along the chain.

By repeating this task across hundreds of millions of protein sequences, ESM-2 learns common patterns in protein sequences. These include which amino acids tend to appear together, which positions are strongly constrained, and which sequence patterns are typical of real, folded proteins.

Importantly, the model is not explicitly trained to predict protein structure or function. It is never given labels such as "this protein is an enzyme" or "this region forms an alpha helix." Instead, information about structure and function emerges as a byproduct of the training task. Accurately predicting missing amino acids requires the model to learn the underlying rules that govern protein sequences.

Once training is complete, the model can be used to generate embeddings for new protein sequences. These embeddings are high-dimensional vectors, where each dimension captures some aspect of the sequence patterns the model has learned. While no single dimension corresponds to a specific biological feature, together they encode information related to sequence context, structure, and function. In this sense, the ability of ESM-2 embeddings to reflect protein structure and function is a useful side effect of how the model was trained.

Exercise 0: The Knot Check (Inputs and Outputs)

Goal: Understand what goes into ESM and what comes out.

In this route, you will work with a short toy protein sequence and a set of precomputed embeddings generated using ESM-2.

In the notebook:

Load the .FASTA containing the sequences
Print the sequences and their lengths.
Load the provided pickle file containing ESM embeddings.
Inspect the keys stored in the pickle file.
Match the sequences to their corresponding embeddings.

Hint: A pickle file stores Python objects (such as dictionaries). You can load it using pickle.load.

Success check: You can load the fasta and pkl files. You can identify the sequences, their lengths, and confirm that the pickle file contains two embeddings for each.

Exercise 1: Residue-Level vs Protein-Level Embeddings

Goal: Understand the two levels at which ESM represents a protein.

ESM-2 produces an embedding for each amino acid in a sequence. These residue-level embeddings can then be combined (pooled) to produce a single embedding that represents the entire protein.

In the notebook:

From the pickle file, extract the data associated with the WT sequence.
Inspect the residue-level embedding array and print its shape.
Inspect the protein-level embedding stored for the same sequence and print its shape.
Answer the following questions:
What are the dimensions of the residue-level and protein-level ESM embeddings?
How does the residue-level embedding shape relate to the length of the protein sequence?

Hint: Residue-level embeddings have shape (L, D), where L is the sequence length. Protein-level embeddings have shape (D,).

Success check: You can explain the difference between residue-level and protein-level embeddings.

Exercise 2: Mean Pooling as a Design Choice

Goal: Learn how protein-level embeddings are constructed from residue-level embeddings.

Protein-level embeddings are often created by averaging (mean pooling) residue-level embeddings across the sequence.

In the notebook:

Take the residue-level embeddings for the WT sequence.
Compute the mean across the sequence length to generate a protein-level embedding.
Compare your result to the protein-level embedding stored in the pickle file.
Answer the following questions
- Why does ESM produce one embedding per amino acid instead of one per protein?
- Why might it be useful to also have a single embedding for the entire protein?

Hint: Mean pooling averages over the first axis of the (L, D) array.

Success check: You can explain how a protein-level embedding is derived from residue-level embeddings.

Exercise 3: Introducing Mutations

Goal: Understand how small sequence changes affect embeddings.

In addition to the WT sequence, the pickle file contains embeddings for several mutated versions of the sequence. Each mutant differs from the WT by a single amino acid change at a different position. Each mutated sequence has a unique ID (e.g., M1_1, M2_2, M3_3) that matches the IDs used in the FASTA file.

In the notebook:

List all sequence IDs stored in the pickle file.
Separate the WT entry from the mutant entries.
Confirm that each mutant differs from the WT at exactly one position.

Exercise 4: Interpreting Mutations in Embedding Space

Goal: Connect specific amino acid changes to changes in embeddings.

So far, you have compared protein embeddings numerically. In this exercise, you will connect those numerical differences back to the actual mutations in the sequences.

To compare embeddings, you will use a similarity score called cosine similarity. This score measures how similar two vectors are in direction, rather than their absolute size.

A value close to 1 means two embeddings are very similar.
Lower values mean the embeddings are more different.

In this setting, cosine similarity tells you how much a mutation changes the protein's embedding compared to the WT.

In the notebook:

For each mutant sequence, identify:
- The position of the mutation
- The original amino acid
- The substituted amino acid
Compute the cosine similarity between the WT embedding and each mutant embedding.
Organize the results in a small table that includes:
- Sequence ID
- Mutation position
- Amino acid change
- Cosine similarity to WT
Answer the following questions
Do mutations near the beginning, middle, or end of the sequence appear to affect the embedding differently?
Do you observe any clear patterns, or do the differences appear small or noisy?

Important note: These sequences are intentionally short and artificial. It is possible that some mutations produce only very small changes in embedding space. That observation itself is meaningful.

Success check: You can relate at least one observed embedding difference (or lack of difference) to a specific amino acid mutation.

Exercise 5: Visualizing Embeddings in Two Dimensions

Goal: Build intuition for how embeddings relate to each other visually.

Protein embeddings are high-dimensional vectors, which makes them hard to visualize directly. A common approach is to project these vectors into two dimensions and plot them.

In this exercise, you will create a simple 2D visualization of the protein-level embeddings using PCA (Principal Component Analysis). You do not need to understand how PCA works yet. Your goal is to use it as a tool and interpret the resulting plot.

This plot is a simplified view of a much higher-dimensional space. Apparent distances and clusters should be interpreted qualitatively, not quantitatively.

In the notebook:

Import PCA from scikit-learn: from sklearn.decomposition import PCA
Collect the protein-level embeddings into a single array.
Use PCA to project the embeddings into two dimensions.
Create a scatter plot showing all the embeddings in two dimensions.
Label each point by its sequence ID.

Hint: Each point in the plot represents one protein embedding. Points that are closer together correspond to more similar embeddings.

Success check: You can describe at least one feature of the plot and relate it back to specific mutations in the sequences.

Deliverables

Please submit the following:

A working Jupyter notebook (.ipynb)
A brief written reflection (3-5 sentences) (.txt)

Submission

SUBMIT LINK