Pretraining a Large Language Model (LLM) from Scratch on DNA Sequences

Overview
Creative Commons License: CC-BY Questions:
  • How to load and configure a pre-trained language model for DNA sequence analysis?

  • What is the process for tokenizing DNA sequences to prepare them for model training?

  • How to split and organize DNA sequence dataset for effective model training and evaluation?

  • What are the key hyperparameters to consider when pretraining a language model on DNA sequences, and how to configure them?

  • How to use a trained language model to generate and interpret embeddings for DNA sequences?

Objectives:
  • Identify and load a pre-trained language model (LLM) suitable for DNA sequence analysis.

  • Explain the role of a tokenizer in converting DNA sequences into numerical tokens for model processing.

  • Prepare and tokenize DNA sequence datasets for model training and evaluation.

  • Configure and implement data collation to organize tokenized data into batches for efficient training.

  • Define and configure hyperparameters for pretraining a model, such as learning rate and batch size.

  • Monitor and evaluate the model’s performance during training to ensure effective learning.

  • Use the trained model to generate embeddings for DNA sequences and interpret these embeddings for downstream bioinformatics applications.

  • Develop a complete workflow for training a language model on DNA sequences, from data preparation to model evaluation, and apply it to real-world bioinformatics tasks.

Requirements:
Time estimation: 3 hours
Level: Intermediate Intermediate
Supporting Materials:
Published: Apr 17, 2025
Last modification: Apr 17, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00520
version Revision: 1
Best viewed in a Jupyter Notebook

This tutorial is best viewed in a Jupyter notebook! You can load this notebook one of the following ways

Run on the GTN with JupyterLite (in-browser computations)

  1. Click to Launch JupyterLite

Launching the notebook in Jupyter in Galaxy

  1. Instructions to Launch JupyterLab
  2. Open a Terminal in JupyterLab with File -> New -> Terminal
  3. Run wget https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-pretraining/statistics-genomic-llm-pretraining.ipynb
  4. Select the notebook that appears in the list of files on the left.

Downloading the notebook

  1. Right click one of these links: Jupyter Notebook (With Solutions), Jupyter Notebook (Without Solutions)
  2. Save Link As..

Generative Artificial Intelligence (AI) represents a cutting-edge domain within machine learning, focused on creating new, synthetic yet realistic data. This includes generating text, images, music, and even biological sequences. At the heart of many generative AI applications are Large Language Models (LLMs), which have revolutionized natural language processing and beyond.

LLMs are sophisticated neural networks trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on Transformers, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery.

Transformers are a type of neural network model designed to handle sequential data, such as text, by using self-attention mechanisms to weigh the importance of input elements relative to each other, enabling the model to understand and generate coherent and contextually relevant outputs.

In this tutorial, we will explore the intersection of generative AI and genomics by pretraining an LLM from scratch on DNA sequences. This process will equip the model with a foundational understanding of the “grammar” of DNA, enabling it to generate and analyze genetic data with remarkable accuracy.

Mistral AI, French artificial intelligence (AI) startup, recently launched large language models (LLMs) showing performances superior to Llama2. In particular, Mixtral-8x7B implements:

  • Grouped-Query Attention: Efficiently computes attention by grouping queries, reducing computational load and memory usage.
  • Sliding-Window Attention: Focuses on a fixed-size window of tokens, sliding over the sequence to manage long texts efficiently.
  • Byte-fallback BPE Tokenizer: Tokenizes text into subword units, falling back to byte-level tokenization for unknown words, ensuring robust handling of diverse text inputs.

These techniques collectively enhance the performance and efficiency of large language models, enabling them to process and generate text more effectively.

In this tutorial, we will use a simplified Mistral model architecture with fewer layers and hidden units to reduce computational requirements. The model will be trained to predict the next base in the sequence. For instance, for a sequence like ATTTGTTGGT, the model will be trained to predict the suffix TTGGT given the prefix ATTTG. This process is called causal language modeling.

To pretrain the model, we will use a file containing 100,000 non-overlapping DNA sequences of 200 bases, corresponding to around 1% of the human genome (hg38 assembly). This involves training the model to predict the end of a DNA sequence.

By the end of this tutorial, we will obtain a Mistral-DNA model with an internal representation of DNA sequence grammar. This pretrained model can then be used for various applications, such as fine-tuning for classification tasks or predicting mutational effects.

Agenda

In this tutorial, we will cover:

  1. Prepare resources
    1. Install dependencies
    2. Import Python libraries
    3. Check and configure available resources
  2. Prepare the model
    1. Load the model
    2. Choose the LLM architecture
  3. Prepare the tokenizer
  4. Prepare data
    1. Load data
    2. Tokenize data
    3. Split data
    4. Data Collation
  5. Train the model
    1. Define parameters for pretraining
    2. Pretrain the model
  6. Compute the embedding of a DNA sequence
  7. Conclusion

Prepare resources

To pretrain the model, let’s open a Notebook or a Python script.

Install dependencies

The first step is to install the required dependencies:

!pip install accelerate
!pip install datasets==3.0.1
!pip install transformers
!pip install torch
Question

What are the required dependencies doing?

  • accelerate: A library by Hugging Face – a platform that provides tools and resources for building, training, and deploying machine learning models – designed to simplify the process of training and deploying machine learning models across different hardware environments. It provides tools to optimize performance on GPUs, TPUs, and other accelerators, making it easier to scale models efficiently.

  • datasets: A library by Hugging Face for managing and processing datasets. It provides tools to load, manipulate, and share datasets in a standardized format, making it easier to work with machine learning data.

  • numpy: A fundamental package for scientific computing in Python.

  • torch: Also known as PyTorch, it is an open-source machine learning library developed by Facebook’s AI Research lab. It provides a flexible platform for building and training neural networks, with a focus on tensor computations and automatic differentiation.

  • transformers: A library by Hugging Face that provides implementations of state-of-the-art transformer models for natural language processing (NLP). It includes pre-trained models and tools for fine-tuning, making it easier to apply transformers to various NLP tasks.

These libraries are widely used in the machine learning and data science communities for their efficiency, flexibility, and extensive functionality.

Import Python libraries

Let’s now import them.

import os

import accelerate
import flash_attn
import torch
import transformers
from datasets import load_dataset
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    DataCollatorForLanguageModeling,
    EarlyStoppingCallback,
    Trainer,
    TrainingArguments,
)
  • datasets:
    • load_dataset: function to load datasets from the Hugging Face Hub or local files.
  • transformers:
    • AutoConfig: Automatically loads the configuration for a pre-trained model. It defines the architecture and hyperparameters of the model.
    • AutoModelForCausalLM: Loads a pre-trained causal language model for tasks like text generation, where the model predicts the next token in a sequence.
    • AutoTokenizer: Loads the tokenizer associated with a pre-trained model. It converts text into tokens that the model can process.
    • DataCollatorForLanguageModeling: A data collator specifically designed for language modeling tasks. It prepares batches of data for training by handling padding and masking.
    • EarlyStoppingCallback: A callback used during training to stop the process early if the model’s performance on the validation set stops > improving, saving time and resources.
    • Trainer: A high-level API for training and evaluating transformer > models. It simplifies the training loop and handles tasks like gradient accumulation and evaluation.
    • TrainingArguments: A class to define the training configuration, including hyperparameters like learning rate, batch size, and number > of epochs. It is used to configure the Trainer.

These components work together to streamline the process of training and fine-tuning transformer models for various NLP tasks.

Comment: Versions

This tutorial has been tested with following versions:

  • accelerate > 0.32.1
  • flash_attn > 2.6.0.post1 and 2.7.0.post2
  • transformers > 4.47.1

You can check the versions with:

accelerate.__version__
flash_attn.__version__
transformers.__version__

Check and configure available resources

To pretrain the model, we need to specific resources:

  • Graphics Processing Unit (GPU): a specialized processor designed to handle complex graphical computations, often used for rendering images, videos, and accelerating machine learning tasks
  • Video Random Access Memory (VRAM): dedicated memory used by a GPU to store and process graphical data, enabling smooth rendering of images and videos

Let’s check the resources:

!nvidia-smi

The command nvidia-smi (NVIDIA System Management Interface) is used to monitor and manage NVIDIA GPU devices. It provides information about the GPU’s utilization, memory usage, temperature, and running processes. This tool is essential for developers and researchers to track the performance and health of GPUs, especially when running computationally intensive tasks like machine learning training.

Question

How do you interpret the following output?

Tue Mar 25 13:49:35 2025       
+-----------------------------------------------------------------------------> ------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA > Version: 12.4     |
|-----------------------------------------+------------------------> +----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile > Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | > GPU-Util  Compute M. |
|                                         |                        |          >      MIG M. |
|=========================================+========================> +======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 > Off |                    0 |
| N/A   40C    P8              9W /   70W |       2MiB /  15360MiB |      > 0%      Default |
|                                         |                        |          >         N/A |
+-----------------------------------------+------------------------> +----------------------+
                                                                              >            
+-----------------------------------------------------------------------------> ------------+
| > Processes:                                                                    >           |
|  GPU   GI   CI        PID   Type   Process > name                              GPU Memory |
|        ID   > ID                                                               Usage      |
|> ==============================================================================> ===========|
|  No running processes > found                                                             |
+-----------------------------------------------------------------------------> ------------+
  • Driver Version: The version of the NVIDIA driver installed on the system (550.54.15).
  • CUDA Version: The version of CUDA installed, which is a parallel computing platform and API model created by NVIDIA (12.4).
  • GPU Name: The model of the GPU, in this case, a Tesla T4.
  • Persistence-M: Indicates whether Persistence Mode is enabled (Off in this case), which can improve performance for certain applications.
  • Bus-Id: The PCI bus ID of the GPU (00000000:00:04.0).
  • Fan: The speed of the GPU fan (N/A means not available or not reporting).
  • Temp: The current temperature of the GPU (40°C).
  • Perf: The performance state of the GPU (P8 indicates a low-power state).
  • Pwr:Usage/Cap: The current power usage (9W) and the power cap (70W).
  • Memory-Usage: The amount of GPU memory currently in use (2MiB) out of the total available (15360MiB).
  • GPU-Util: The percentage of GPU utilization (0% indicates the GPU is idle).
  • Compute M.: The compute mode of the GPU (Default).
  • Processes: Lists any processes currently using the GPU. In this case, there are no running processes.

Let’s configure PyTorch and the CUDA environment – software and hardware ecosystem provided by NVIDIA to enable parallel computing on GPU – to optimize GPU memory usage and performance:

  1. Enables CuDNN benchmarking in PyTorch:

     torch.backends.cudnn.benchmark=True
    
    Question
    1. What is CuDNN?
    2. Why enabling benchmarking?
    1. CuDNN is a GPU-accelerated library for deep neural networks.
    2. Enabling benchmarking allows CuDNN to select the fastest algorithms for the specific GPU and input size. This can improve the performance of the model, especially for fixed-size inputs.
  2. Set an environment variable that configures how PyTorch manages CUDA memory allocations

     os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
    
    Question

    What is this command doing?

    It sets the maximum split size for memory allocations to 32 megabytes. This can help reduce memory fragmentation and improve memory utilization, which is particularly useful when working with large models or limited GPU memory.

Prepare the model

Load the model

Let’s load now the model, Mistral-DNA. The Mixtral model (Mixtral-8x7B-v0.1) – a pretrained generative Sparse Mixture of Experts outperforming Llama 2 70B – was modified to significantly reduce the number of parameters mostly by removing layers, such that it could be trained on a GPU such as an RTX3090.

We will get the model from GitHub:

!git clone https://github.com/raphaelmourad/Mistral-DNA.git

Let’s check if we have the model now:

!ls

We should get two folders: Mistral-DNA and sample_data. Let’s change the current working directory to Mistral-DNA/:

os.chdir("Mistral-DNA/")

Choose the LLM architecture

Let’s look at the original archicture of Mixtral-8x7B-v0.1 which is stored in the data/models/Mixtral-8x7B-v0.1 folder (GitHub).

Question
  1. Which file is essential for the configuring the language model?
  2. What are the key parameters of the simplified architecture used here?
  1. The config.json file is essential for configuring the language model as a Mistral model. It specifies the architecture for causal language modeling (MixtralForCausalLM) and details the size of the neural network components. The original Mistral model has a larger hidden size, but it is reduced here to make pre-training feasible.
  2. The key parameters are:
    • Intermediate Size (intermediate_size): Size of the intermediate (or hidden) layers within the model. It determines the number of neurons in these layers, influencing the model’s capacity to capture complex patterns in the data. A larger intermediate size can capture more nuanced details but also requires more computational resources. Set to 256, which is relatively small compared to the original model.
    • Number of Attention Heads (num_attention_heads): Number of attention heads in the multi-head attention mechanism. Each head allows the model to focus on different parts of the input sequence simultaneously, capturing diverse aspects of the data. More attention heads can provide a richer representation but also increase computational complexity. Reduced to 8 for efficiency.
    • Number of Experts per token (num_experts_per_tok): Specific to models that use a Mixture of Experts (MoE) architecture. It indicates the number of expert networks that are activated for each token in the input sequence. Experts are specialized sub-networks that handle different parts of the data, improving efficiency and performance, especially for large models. Set to 1 expert per token.
    • Number of Local Experts (num_local_experts): Number of local experts available in the model. Local experts are a subset of the total experts and are used to process specific parts of the input data. This localization can help in managing computational resources more effectively, especially when dealing with large-scale data. Set to 64.
    • Vocabulary Size (vocab_size): Specifically designed for DNA sequences, with a size of \(4,096 = 4^6\), as DNA consists of four possible letters (A, T, C, and G) and the words are 6-mers (sequences of six nucleotides). By modeling DNA using 6-mers, we capture meaningful patterns within the genetic sequence, enabling the model to understand and generate DNA data effectively.

Let’s load the configuration of the pre-trained model:

config = AutoConfig.from_pretrained("data/models/Mixtral-8x7B-v0.1")

By loading the configuration, we can inspect or modify the model’s architecture without loading the actual model weights. Let’s now initialize a causal language model from the loaded configuration object, with a specific attention implementation:

model = AutoModelForCausalLM.from_config(config, attn_implementation="eager")
Question

What does attn_implementation="eager"?

attn_implementation="eager" specifies the attention implementation to use. Setting it to “eager” means that the attention mechanism will be executed eagerly, which can be useful for debugging or when working with dynamic computation graphs. Eager execution runs operations immediately as they are called in Python, rather than adding them to a graph for later execution.

How does the model look like?

model
MixtralForCausalLM(
  (model): MixtralModel(
    (embed_tokens): Embedding(4096, 256)
    (layers): ModuleList(
      (0-7): 8 x MixtralDecoderLayer(
        (self_attn): MixtralAttention(
          (q_proj): Linear(in_features=256, out_features=256, bias=False)
          (k_proj): Linear(in_features=256, out_features=256, bias=False)
          (v_proj): Linear(in_features=256, out_features=256, bias=False)
          (o_proj): Linear(in_features=256, out_features=256, bias=False)
          (rotary_emb): MixtralRotaryEmbedding()
        )
        (block_sparse_moe): MixtralSparseMoeBlock(
          (gate): Linear(in_features=256, out_features=64, bias=False)
          (experts): ModuleList(
            (0-63): 64 x MixtralBlockSparseTop2MLP(
              (w1): Linear(in_features=256, out_features=256, bias=False)
              (w2): Linear(in_features=256, out_features=256, bias=False)
              (w3): Linear(in_features=256, out_features=256, bias=False)
              (act_fn): SiLU()
            )
          )
        )
        (input_layernorm): MixtralRMSNorm((256,), eps=1e-05)
        (post_attention_layernorm): MixtralRMSNorm((256,), eps=1e-05)
      )
    )
    (norm): MixtralRMSNorm((256,), eps=1e-05)
  )
  (lm_head): Linear(in_features=256, out_features=4096, bias=False)
)

As expected, the model is a MixtralForCausalLM model with several key components:

  1. Embedding Layer (embed_tokens): Converts input DNA sequences into dense vectors of fixed size. It maps each of the 4,096 (\(4^{6}\)) possible DNA tokens (representing 6-mers) to a 256-dimensional vector space. This embedding layer is crucial for transforming discrete DNA sequences into a format suitable for neural network processing.

  2. Decoder Layers (layers): Consists of eight MixtralDecoderLayer modules, each containing several sub-components:
    • Self-Attention Mechanism (self_attn)

      Question
      1. What are the components?
      2. How is the purpose?
      1. The components are linear projections (q_proj, k_proj,v_proj, o_proj) for queries, keys, values, and outputs, along witha rotary embedding (rotary_emb) to incorporate positiona linformation.
      2. This allows the model to weigh the importance of differenttokens in the sequence relative to each other, capturing dependenciesand context.
    • Sparse Mixture of Experts (block_sparse_moe):

      Question
      1. What are the components?
      2. How is the purpose?
      1. The components are gating mechanism (gate) and list of 64 expert networks (experts), each with multiple linear layers (w1, w2, w3) and an activation function (act_fn).
      2. This efficiently processes input data by activating only a subset of expert networks, reducing computational load while maintaining model capacity.
    • Layer Normalization (input_layernorm, post_attention_layernorm): Stabilizes and accelerates the training process by normalizing the inputs and outputs of the attention mechanism.

  3. Final Layer Normalization (norm): Applies normalization to the output of the final decoder layer, ensuring stable and consistent outputs.

  4. Language Model Head (lm_head): Projects the 256-dimensional output of the final decoder layer back into the 4,096-dimensional vocabulary space of DNA tokens. This linear layer (Linear) maps the hidden states to the original token space, enabling the model to predict the next DNA token accurately.

This architecture ensures that the model can capture complex patterns in DNA sequences while maintaining computational efficiency, making it suitable for tasks like DNA sequence generation and analysis. The model’s design culminates in the output of 4,096 tokens, aligning with the input dimension. This consistency is crucial for accurately predicting the next token in a given DNA sequence, ensuring that the model’s predictions are coherent and reliable.

Question

How many parameters are this model?

pytorch_total_params = sum(p.numel() for p in model.parameters())
print(f"Model size: {pytorch_total_params/1000**2:.1f}M parameters")

There are 105 millions parameters. It is a big model.

Prepare the tokenizer

A tokenizer is a crucial component in natural language processing (NLP) that transforms raw text into a format that can be processed by machine learning models. In this section, we will load and configure the Byte-Pair Encoding (BPE) letter tokenizer. The BPE tokenizer efficiently handles rare and unknown words by breaking them down into frequent subword units, ensuring that the model can generalize better to unseen data. This process involves initializing the tokenizer with a predefined vocabulary and settings, enabling it to convert text into a format suitable for neural network processing. By doing so, we prepare the tokenizer to effectively manage DNA sequences, facilitating accurate and reliable model predictions.

Let’s loads a pre-trained tokenizer from the Hugging Face Model Hub. The tokenizer is associated with the model DNABERT-2-117M, which is designed for processing DNA sequences.

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
Question

What does the above command?

  • AutoTokenizer.from_pretrained automatically identifies and loads the appropriate tokenizer for the specified model. There are 1876 sequences.
  • trust_remote_code=True allows the loading of custom tokenizers that may include remote code execution. It is necessary when the tokenizer requires additional custom code to function correctly.

Let’s look at the created tokenizer now:

print(tokenizer)
PreTrainedTokenizerFast(name_or_path='zhihan1996/DNABERT-2-117M',vocab_size=4096, model_max_length=1000000000000000019884624838656,is_fast=True, padding_side='right', truncation_side='right',special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': [PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'},clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

The PreTrainedTokenizerFast is a fast and efficient tokenizer used to process text data for the DNABERT-2-117M model. Here’s a breakdown of its configuration:

  • name_or_path='zhihan1996/DNABERT-2-117M': Specifies the name or path of the pre-trained tokenizer, indicating that it is associated with the DNABERT-2-117M model, which is designed for processing DNA sequences.

  • vocab_size=4096: Defines the size of the tokenizer’s vocabulary.

    Question

    Why is the size of the tokenizer’s vocabulary set to 4,096?

    It corresponds to the number of unique tokens (6-mers) that the model can recognize in DNA sequences.

  • special_tokens: Defines a set of special tokens used by the tokenizer:

    • unk_token: '[UNK]' - Represents unknown or out-of-vocabulary tokens.
    • sep_token: '[SEP]' - Used to separate segments within a sequence.
    • pad_token: '[PAD]' - Used for padding sequences to a uniform length.
    • cls_token: '[CLS]' - Typically used as the first token in a sequence to represent the classification token.
    • mask_token: '[MASK]' - Used in masked language modeling to hide tokens that the model must predict.
Question

What do the other configuration parameters mean?

  1. model_max_length=1000000000000000019884624838656
  2. is_fast=True
  3. padding_side='right'
  4. truncation_side='right'
  5. clean_up_tokenization_spaces=False
  6. added_tokens_decoder
  1. model_max_length=1000000000000000019884624838656: Represents the maximum length of sequences that the model can handle.

    This extremely large value suggests that the model is designed to process very long sequences, although in practice, the actual limit will be constrained by available computational resources.

  2. is_fast=True: Indicates that this tokenizer is optimized for speed, leveraging Rust-based implementations to accelerate tokenization processes.
  3. padding_side='right': Configures the tokenizer to pad sequences on the right side, ensuring that all sequences in a batch have the same length by adding padding tokens to the end of shorter sequences.
  4. truncation_side='right': Specifies that sequences will be truncated from the right side if they exceed the maximum length, preserving the beginning of the sequence.
  5. clean_up_tokenization_spaces=False: Indicates that the tokenizer will not remove spaces after tokenization, preserving the original spacing in the text.
  6. added_tokens_decoder: Maps token IDs to their corresponding AddedToken objects, which include metadata such as whether the token is a special token and how it should be processed (e.g., stripping whitespace).

This configuration ensures that the tokenizer is tailored to efficiently process DNA sequences, handling both the tokenization and padding/truncation of sequences in a manner that aligns with the model’s requirements.

By default, tokenizers may pad sequences on the right side (padding_side='right'). Let’s set the padding direction for the tokenizer.

tokenizer.padding_side  = "left"

When tokenizing a batch of sequences, shorter sequences will be padded with special tokens on the left to match the length of the longest sequence in the batch. This can be useful for ensuring consistent input sizes, especially in models that expect fixed-size inputs.

Let’s look at how some DNA sequences are encoded by the tokenizer. We start with a simple sequence “ATT”:

encoding = tokenizer("ATT", padding="longest", return_tensors="pt")
print(encoding)

The code tokenizes the DNA sequence “ATT”, pads it to the longest sequence in the batch (padding="longest"), and returns the result as PyTorch tensors (return_tensors="pt").

{'input_ids': tensor([[   1, 2061,    2]]), 'token_type_ids': tensor([[0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1]])}

Here’s a breakdown of each output component:

  • input_ids: A tensor containing the token IDs for the sequence. Each number corresponds to a specific token in the tokenizer’s vocabulary. In this case, [1, 2061, 2] represents the tokens for the sequence:
    • 1: the beginning of the sentence ([CLS])
    • 2061: the sentence itself (ATT)
    • 2: the end of the sentence, a separator between sentence ([SEP]).
  • token_type_ids: A tensor indicating the type of each token, often used in models that process multiple segments (e.g., question-answering). Here, all tokens are of type 0, suggesting a single segment.

  • attention_mask: A tensor that specifies which tokens should be attended to by the model (1 for real tokens, 0 for padding). In this case, all tokens are valid, so the mask is [1, 1, 1].

This encoded format is ready for input into a transformer model, ensuring that the sequence is correctly processed and understood by the model.

Question

What is the encoding for “ATTGTGGGTCCCCGTAGATGATAGGGGCCCCCC”? Specify that the tokenized sequence should have a maximum length of 5 tokens and ensure that the sequence is padded to the specified max_length of 5 tokens.

  • To specify that the tokenized sequence should have a maximum length of 5 tokens, you need to put max_length=5 – if the sequence is longer, it will be truncated –
  • To ensure that the sequence is padded to the specified max_length of 5 tokens, you need to add padding='max_length' – if the sequence is shorter, padding tokens will be added
encoding = tokenizer("ATTGTGGGTCCCCGTAGATGATAGGGGCCCCCC", max_length=5, padding='max_length', truncation=True, return_tensors="pt")
print(encoding)
{'input_ids': tensor([[   1, 2061,  281,  485,    2]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

In this case, [1, 2061, 281, 485, 2] represents the tokens for the sequence, likely including special tokens like [CLS] and [SEP]. As before, all tokens are of type 0, suggesting a single segment, and are valid, so the mask is [1, 1, 1, 1, 1].

Prepare data

We will now prepare the data.

Load data

First we load the data. We will not use here the whole human genome because it comprises too many sequences. Instead, we use a small subset of the data, which is less than 1% of the sequences from the human genome.

Comment: Pre-trained model on the whole human genome

A compact DNA model with approximately 1 million parameters that has been trained on the entire human genome can be found on Hugging Face

We use the load_dataset function from the datasets library. This function is commonly used for loading data for Hugging Face Transformers.

dataset_text = load_dataset("csv", data_files="data/genome_sequences/hg38/sequences_hg38_200b_verysmall.csv.gz")
Question
  1. How is dataset_text structured?
  2. What are the first 5 train dataset in the data?
  3. How long are the sequences?
  1. dataset_text is a DatasetDict with a train Dataset containing 1 feature ('text') of 99,999 rows (obtained with dataset_text)
  2. To get the 5 train dataset in the data:

    dataset_text['train']['text'][0:5]
    
    ['TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAA',
    'CCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCC',
    'TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCT',
    'GAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGA',
    'CACATGCTAGCGCGTCGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTACCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCACCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGC']
    
  3. The sequences are 200 base pair long:

    len(dataset_text['train']['text'][0])
    
    200
    

Tokenize data

Let’s tokenize the data. First, we create a function that tokenizes a text using the BPE letter tokenizer:

def tokenize_function(examples):
    return tokenizer(examples['text'], padding="longest", truncation=True, return_tensors="pt")
Question

What do the following parameters?

  1. padding="longest"
  2. truncation=True
  3. return_tensors="pt"
  1. padding="longest" ensures that all sequences in the batch are padded to the length of the longest sequence, adding padding tokens as needed.
  2. truncation=True specifies that sequences exceeding the model’s maximum length will be truncated to fit.
  3. return_tensors="pt" indicates that the output should be in the form of PyTorch tensors, suitable for use with PyTorch-based models.

We can now apply this function to the load dataset:

dataset = dataset_text.map(tokenize_function, batched=True)

It is quite fast for the almsot 100,000 sequence of length 200 bp.

Question
  1. How is dataset structured?
  2. What is in the first tokenized sequence of train Dataset?
  1. dataset is
    DatasetDict({
        train: Dataset({
            features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
            num_rows: 99999
        })
    })
    

    dataset is a DatasetDict with 1 train Dataset made of 99,999 rows and 4 features:

    • text: The original text data before tokenization.
    • input_ids: The tokenized input data, represented as numerical IDs.
    • token_type_ids: Indicates the type of each token, useful for models that handle multiple segments.
    • attention_mask: Specifies which tokens should be attended to by the model (1 for real tokens, 0 for padding).
  2. The first tokenized sequence of train Dataset (dataset["train"][1]) is a dictionary with:
    • text: 200 base pair sequence
    • input_ids: list of 49 numerical values, the token IDs.
    • token_type_ids: list 49 0
    • attention_mask: list of 7 0 (padding) and 42 1 (real tokens)

Split data

We will now split data between training and validation sets randomly. This is a crucial step in machine learning to ensure the model can generalize to unseen data.

For that, 80% of the entire data will be used for the training set and the remaining 20% will go into the validation set. We first compute the size of training and validation sets:

train_size = int(0.8 * len(dataset["train"]))
val_size = len(dataset["train"]) - training_size
Question

How big are training and validation sets?

Training set has 79,999 sequences and the validation set 20,000.

To perform the actual splitting of the training dataset into two subsets, we use the torch.utils.data.random_split function from the PyTorch library that randomly splits a dataset into subsets.

train_set, val_set = torch.utils.data.random_split(dataset["train"], [train_size, val_size])

Data Collation

The DataCollatorForLanguageModeling is a utility class, designed to prepare and format batches of data for language modeling tasks. It handles the dynamic padding and masking of input sequences, ensuring that each batch fed into the model is correctly formatted and optimized for training.

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
Question

What are the different parameters?

  • tokenizer=tokenizer specifies the tokenizer to be used for processing the input data. The tokenizer converts raw text into numerical tokens that the model can understand.
  • mlm=False: Indicates that the data collator is set up for causal language modeling (CLM) rather than masked language modeling (MLM).

This will:

  1. Automatically pads sequences within a batch to ensure they are of equal length, which is necessary for efficient batch processing in neural networks.
  2. Generates attention masks that indicate which tokens should be attended to by the model, ignoring padding tokens.
  3. Collates individual examples into batches, handling the necessary formatting and ensuring compatibility with the model’s input requirements.

The DataCollatorForLanguageModeling is typically used in conjunction with a Trainer from the Hugging Face library. It simplifies the data preparation process, allowing you to focus on model training and evaluation without worrying about the intricacies of batch formatting.

Train the model

Define parameters for pretraining

We are now going to defines the hyperparameters and configurations for training the language model using the Hugging Face transformers.

Before, we specify the batch size for training and evaluation. A batch size of 32 means that 32 samples will be processed before the model updates its weights. This size is chosen to balance computational efficiency and memory usage.

batchsize=32
training_args = TrainingArguments(
  output_dir="./results/models",
  evaluation_strategy="epoch",
  save_strategy="epoch",
  num_train_epochs=50,
  per_device_train_batch_size=batchsize,
  per_device_eval_batch_size=batchsize,
  learning_rate=5e-4,
  weight_decay=0.01,
  logging_dir="./logs",
  load_best_model_at_end=True,
  fp16=True,
  gradient_accumulation_steps=50,
  report_to="none",
)
  • output_dir="./results/models": directory where the training outputs, including model checkpoints and results, will be saved.
  • evaluation_strategy="epoch" indicates that the model’s performance will be evaluated at the end of each epoch, a complete pass through the entire training dataset. This allows for monitoring the model’s progress and adjusting the training process as needed.
  • save_strategy="epoch" specifies that the model will be saved at the end of each epoch. This ensures that checkpoints are available for each complete pass through the dataset.
  • num_train_epochs=50 sets the total number of training epochs to 50. This means the model will iterate over the entire dataset 50 times, allowing it to learn and optimize over multiple passes.
  • per_device_train_batch_size=batchsize and per_device_eval_batch_size=batchsize set the batch size for training and evaluation on each device (e.g., GPU) to 32. This ensures consistency in batch processing across different stages of training and evaluation.
  • learning_rate=5e-4 defines the learning rate for the optimizer, set to \(5 \times 10^{-4}\). This rate controls the step size during gradient descent and is a common choice for pre-training models.
  • weight_decay=0.01 applies L2 regularization to the model weights with a standard decay rate of 0.01. This helps prevent overfitting by penalizing large weights.
  • logging_dir="./logs" specifies the directory where training logs will be stored, allowing for monitoring and analysis of the training process.
  • load_best_model_at_end=True ensures that the best model, based on the lowest evaluation loss, is loaded at the end of training. This helps in selecting the model with the best performance across all epochs. During gradient descent, the model will be optimized, and at some point, the loss will start to increase again. We want to pick the model with the lowest loss, not when it starts increasing. So, “load best model at the end” means selecting the model with the best loss across all epochs.
  • fp16=True enables mixed-precision training using 16-bit floating-point numbers. This reduces memory usage and can speed up training on compatible hardware.
  • gradient_accumulation_steps=50 accumulates gradients over 50 steps before performing a backward pass. This effectively increases the batch size without requiring additional memory, helping to stabilize training.
  • report_to="none" disables Weights & Biases (WandB), a popular platform used for experiment tracking, dataset versioning, and model management in machine learning

    Comment: Why Disable WandB?

    Disabling WandB is often done in specific scenarios:

    • Avoiding Unwanted Logging: If we do not intend to use WandB for tracking our experiments or if we want to avoid potential conflicts with other logging mechanisms, we would disable it.
    • Reducing Overhead: WandB logging can introduce some overhead, particularly when dealing with large datasets or complex models. Disabling it can slightly improve performance if tracking is not essential.
    • Testing/Debugging: During testing or debugging, we might prefer to have more control over logging or we might want to avoid cluttering our WandB workspace with intermediate results.
Question

What is stored in training_args: the parameters to the model, the parameter of the LLM or the parameters of the trainer function?

The parameters of the trainer function

Pretrain the model

Here is the most important part: the pre-training process. For this, we will use a Trainer function. This function takes as input the model that we built previously, which has an architecture but no initialized weights.

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_set,
    eval_dataset=val_set,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

The Trainer function also takes:

  • args: the training arguments we configured earlier
  • data_collator: the data collator function feeding the tokenized data sequences to the model.
  • train_dataset: the training set, i.e. the data used for computing the gradients
  • eval_dataset: the validation set, i.e. the data used to assess the prediction accuracy at each epoch. It’s important to use a validation set that is independent of the training set to ensure unbiased evaluation.
  • callbacks: EarlyStoppingCallback with a patience of three is used to monitor the training process.

    During training, we minimize the loss at each step. However, at some point, the loss may start to increase again. We want to capture the model parameters when the loss reaches its minimum. By using a patience of three, we aim to mitigate the effects of noise during training. Noise can cause fluctuations in the loss, making it seem like we’ve reached a local minimum when a better one might be found with further training.

    With a patience of three, even if we find a good minimum, we wait for three more epochs to ensure that the loss does not improve further. If the loss does not decrease for three consecutive epochs, we stop training. However, if a better model with a lower loss is found within those three epochs, training continues. This approach helps in finding a more robust local minimum by reducing the impact of noise in the training data.

Let’s launch the training with trainer.train() method

trainer.train()

Here, the trainer is set to run for 50 epochs. After the initiation, we get an estimation of the time it takes per epoch to get an idea of the total training duration. Let’s run it for a bit to see how long it takes.

With this small model and dataset, the estimated time to run 50 epochs is 20 hours – this value changes depending on the infrastructure –.

Question

Will the model be trained to 50 epochs?

Setting the number of epochs to 50 doesn’t mean the model will train for all 50 epochs. It’s likely to stop earlier

The 50 epochs serve as a maximum limit. The model will stop training earlier if it reaches the minimum loss and then starts to increase again, thanks to the early stopping callback. This means the model might only require half the epochs, perhaps 25 epochs or 10 hours, to achieve optimal performance.

Comment: Don't train until the end

The idea here is not to train the model until completion, as it would take too much time.

Let’s stop the actual training and cheat a bit by loading a previously trained Mistral model:

model = AutoModelForCausalLM.from_pretrained("RaphaelMourad/Mistral-DNA-v1-17M-hg38")

This is a mixed model that was pre-trained on the entire Human Genome. It contains approximately 17 million parameters and was trained using the Human Genome assembly GRCh38. Unlike models pre-trained on sequences of 200 bases, this model was pre-trained on sequences of 10,000 bases (10K). The advantage of this model is its ability to process larger DNA contexts or sequences. This capability allows it to capture more extensive patterns and dependencies within the genomic data.

Question

By looking at the output of:

model
  1. How many transformer layers does this model have?
  2. Is it similar to previous model?
  1. 8 transformer layers
  2. Yes

Compute the embedding of a DNA sequence

With this kind of model something, we can convert the DNA sequence to a vector.

Let’s:

  1. Take a DNA sequence
  2. Tokenizes the DNA sequence using the tokenizer created before
  3. Extracts the tensor containing the token IDs from the tokenized output
  4. Passes the tokenized input through the model.
  5. Extracts the hidden states from the model’s output.
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
tokenized_dna = tokenizer(dna, return_tensors = 'pt')
inputs = tokenized_dna["input_ids"]
model_outputs = model(inputs)
hidden_states = model_outputs[0]

The generated hidden states are the internal representations of the input sequence at different layers of the model. Here we look at the hidden neurons of the last layer. They capture contextual information about the sequence and provide a richer representation of the sequence compared to the raw nucleotide string, capturing contextual information that can be used for tasks such as sequence similarity analysis, functional prediction, variant impact analysis, and more.

Question

What is the shape of hidden_states?

[1, 17, 4096]:

  • 1: number of sequences, here 1 DNA sequence
  • 17: number of words, here the DNA sequence has been converted to 17 words larger that 1
  • 4096: size of the vocabulary, the number of possible tokens

We would like now to calculate the mean of the hidden states across a specific dimension, here the first layer of the model (hidden_states[0]):

embedding_mean = torch.mean(hidden_states[0], dim=0)

dim=0 indicates that the mean is calculated across the sequence length dimension. This effectively averages the hidden states for each token position in the sequence, resulting in a single vector that represents the entire sequence.

Question
  1. What is the shape of embedding_mean?
  2. Which type of data is in embedding_mean?
  1. 4096, the number of possible tokens.
  2. embedding_mean is a vector of numerical values.

embedding_mean is a numerical vector of size 4,096 that represents the average embedding of the DNA sequence. This fixed-size representation can be used for various downstream tasks, such as classification, clustering, or similarity comparisons.

Hands On

Apply a max pooling instead of a mean pooling to summarize information along the DNA sequence.

embedding_max = torch.max(hidden_states[0], dim=0)[0]
Comment: Similar process to ChatGPT

When you use a system like ChatGPT, the process involves converting your textual input, or “prompt,” into a numerical vector. This conversion is similar to the process we just did. Here’s how it works:

  • Input Prompt: You write a prompt, which is a textual query or statement.
  • Tokenization: The prompt is tokenized, meaning it is broken down into smaller units, such as words or subwords, using a tokenizer.
  • Vector Representation: These tokens are then converted into numerical vectors, or embeddings. These vectors capture the semantic meaning and context of the words in the prompt.
  • Model Processing: The model processes these vectors to generate a response. The embeddings allow the model to understand the context and nuances of your input, enabling it to produce coherent and relevant responses.

This process of converting text into numerical vectors is fundamental to how language models like ChatGPT operate, enabling them to interpret and generate human-like text based on the input they receive.

Conclusion

This tutorial provides a comprehensive guide to preparing, training, and utilizing a pre-trained language model for DNA sequence analysis. It begins by setting up the necessary resources, including installing dependencies, importing Python libraries, and configuring computational resources. The tutorial then walks through loading and choosing an appropriate model architecture for DNA sequences, followed by setting up a tokenizer to convert DNA sequences into numerical tokens. Data preparation involves loading, tokenizing, splitting, and collating DNA sequences to ensure efficient model training. The training process is detailed with parameter definitions and pretraining steps, culminating in the calculation of DNA sequence embeddings.

We can now leverage the pre-trained model in various bioinformatics applications, such as sequence similarity analysis and functional prediction, offering a robust foundation for integrative biological research.