This tutorial is not in its final state. The content may change a lot in the next months.
Because of this status, it is also not listed in the topic pages.
Open a Terminal in JupyterLab with File -> New -> Terminal
Run wget https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-finetuning/statistics-genomic-llm-finetuning.ipynb
Select the notebook that appears in the list of files on the left.
After preparing, training, and utilizing a language model for DNA sequences, we can now fine-tune a pre-trained Large Language Model (LLM) for specific DNA sequence classification tasks. Here, we will use a pre-trained model from Hugging Face, specifically the Mistral-DNA-v1-17M-hg38, and adapt it to classify DNA sequences based on their biological functions. Our objective is to classify sequences according to whether they bind to transcription factors.
Comment: Transcription factors
Transcription factors are proteins that play a crucial role in regulating gene expression by binding to specific DNA sequences, known as enhancers or promoters. These proteins act as molecular switches, turning genes on or off in response to various cellular signals and environmental cues. By binding to DNA, transcription factors either promote or inhibit the recruitment of RNA polymerase, the enzyme responsible for transcribing DNA into RNA, thereby influencing the rate of transcription.
Figure 1: Two types of DNA sequences. On the left, a DNA sequence that binds the transcription factor CTCF. On the right, a DNA sequence that does not bind CTCF.
Transcription factors are essential for numerous biological processes, including cell differentiation, development, and response to external stimuli. Their ability to recognize and bind specific DNA sequences allows them to orchestrate complex gene expression programs, ensuring that the right genes are expressed at the right time and in the right place within an organism. Understanding the function and regulation of transcription factors is vital for deciphering the molecular mechanisms underlying health and disease, and it opens avenues for developing targeted therapeutic interventions.
This classification task is crucial for understanding gene regulation, as transcription factors play a vital role in controlling which genes are expressed in a cell. By training a model to predict whether a DNA sequence binds to a transcription factor, we can gain insights into regulatory mechanisms and potentially identify novel binding sites or understand the impact of genetic variations on transcription factor binding.
By fine-tuning the model, we aim to leverage its pre-trained knowledge of DNA sequences to achieve high accuracy in this classification task. This tutorial will guide you through the necessary steps, from data preparation to model evaluation, ensuring you can apply these techniques to your own research or projects.
We will use Mistral-DNA-v1-17M-hg38, a mixed model that was pre-trained on the entire Human Genome. It contains approximately 17 million parameters and was trained using the Human Genome assembly GRCh38 on sequences of 10,000 bases (10K):
accelerate is a library by Hugging Face – a platform that provides tools and resources for building, training, and deploying machine learning models – designed to simplify the process of training and deploying machine learning models across different hardware environments. It provides tools to optimize performance on GPUs, TPUs, and other accelerators, making it easier to scale models efficiently.
The PEFT (Parameter-Efficient Fine-Tuning) Python library, developed by Hugging Face, is a tool designed to efficiently adapt large pretrained models to various downstream tasks without the need to fine-tune all of the model’s parameters. By focusing on a small subset of parameters, PEFT significantly reduces computational and storage costs, making it feasible to fine-tune large language models (LLMs) on consumer-grade hardware. The library integrates seamlessly with the Hugging Face ecosystem, including Transformers, Diffusers, and Accelerate, enabling streamlined model training and inference. PEFT supports techniques like LoRA (Low-Rank Adaptation) and prompt tuning, and it can be combined with quantization to further optimize resource usage. Its open-source nature fosters collaboration and accessibility, allowing developers to customize models for specific applications quickly and efficiently.
torch, also known as PyTorch, it is an open-source machine learning library developed by Facebook’s AI Research lab. It provides a flexible platform for building and training neural networks, with a focus on tensor computations and automatic differentiation.
transformers is a library by Hugging Face that provides implementations of state-of-the-art transformer models for natural language processing (NLP). It includes pre-trained models and tools for fine-tuning, making it easier to apply transformers to various NLP tasks.
Quantization is a technique used in machine learning and signal processing to reduce the precision of numerical values, typically to decrease memory usage and computational requirements. This process is particularly useful when working with large models as it allows them to be deployed on hardware with limited resources without significantly sacrificing performance.
Here, we use BitsAndBytesConfig to configure a 4-bit quantization. Using 4-bit precision reduces the memory footprint of the model, which is particularly useful for very large models that might not fit into GPU memory otherwise:
load_in_4bit=True: Specifies that the model should be loaded with 4-bit quantization. Using 4-bit precision reduces the memory footprint of the model, which is particularly useful for very large models that might not fit into GPU memory otherwise.
bnb_4bit_use_double_quant=True: enables double quantization, which means that the quantization constants from the first quantization are quantized again. This further reduces the memory footprint, although it may introduce additional computational overhead.
bnb_4bit_compute_dtype=torch.bfloat16: sets the compute data type to bfloat16 (Brain Floating Point 16-bit format). Using bfloat16 can provide a good balance between computational efficiency and numerical stability, especially on hardware that supports this format, such as certain GPUs and TPUs.
Configure Accelerate
Now, we will configure the Hugging Face Accelerate library to optimize the training process for large models using Fully Sharded Data Parallel (FSDP). This setup is crucial for efficiently utilizing GPU resources and enabling distributed training across multiple devices.
First, we need to configure the FSDP plugin, which will manage how model parameters and optimizer states are sharded across GPUs. This configuration helps in reducing memory usage and allows for the training of larger models.
FullStateDictConfig: Configures how the model’s state dictionary (parameters) is managed.
offload_to_cpu=True: Specifies that the model’s parameters should be offloaded to CPU memory when not in use. This helps free up GPU memory, especially useful when working with large models.
rank0_only=False: Indicates that the state dictionary operations (like saving and loading) are not restricted to the rank 0 process. This allows all processes to participate in these operations, which can be beneficial for distributed training setups.
FullOptimStateDictConfig: Configures how the optimizer’s state dictionary is managed.
offload_to_cpu=True: Similar to the model’s state dictionary, this setting offloads the optimizer states to CPU memory when not in use, further reducing GPU memory usage.
rank0_only=False: Allows all processes to handle the optimizer state dictionary operations, ensuring that the optimizer states are managed efficiently across the distributed setup.
Next, we initialize the Accelerator from the Hugging Face Accelerate library, integrating the FSDP plugin for seamless distributed training:
accelerator=Accelerator(fsdp_plugin=fsdp_plugin)
By passing the FSDP plugin to the Accelerator, we enable sharded data parallelism, which efficiently manages model and optimizer states across multiple GPUs.
With this configuration, the Accelerator will handle the complexities of distributed training, allowing us to focus on developing and experimenting with our models. This setup is particularly beneficial when working with large-scale models and limited GPU resources, as it optimizes memory usage and enables faster training times.
Configure LoRA for Parameter-Efficient Fine-Tuning
We will configure the LoRA (Low-Rank Adaptation) settings for parameter-efficient fine-tuning of a large language model. LoRA is a technique that allows us to fine-tune only a small number of additional parameters while keeping the original model weights frozen, making it highly efficient for adapting large models to specific tasks.
We use the LoraConfig class to define the settings for LoRA. This configuration specifies how the low-rank adaptations are applied to the model.
r=16: This parameter specifies the rank of the low-rank matrices used in the adaptation. A higher rank allows the model to capture more complex patterns but also increases the number of trainable parameters.
lora_alpha=16: This scaling factor controls the magnitude of the updates applied by the low-rank matrices. It helps balance the influence of the adaptations relative to the original model weights.
lora_dropout=0.05: Dropout is applied to the low-rank matrices during training to prevent overfitting. A dropout rate of 0.05 means that 5% of the elements are randomly set to zero during each training step.
bias="none": This setting specifies that no bias parameters are added to the low-rank adaptations. Other options include “all” to add biases to all layers or “lora_only” to add biases only to the LoRA layers.
task_type="SEQ_CLS": This indicates that the model is being fine-tuned for a sequence classification task. Other task types might include “CAUSAL_LM” for causal language modeling or “SEQ_2_SEQ_LM” for sequence-to-sequence tasks.
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj"]: This list specifies the modules within the model architecture to which the LoRA adaptations will be applied. These modules are typically the attention layers in transformer models:
"q_proj": query projections
"k_proj": key projections
"v_proj": value projections
"o_proj": output projections
"gate_proj": gating projections in some architectures.
By configuring LoRA in this way, we can efficiently adapt a large pretrained model to a specific task with minimal computational overhead, making it feasible to fine-tune on consumer-grade hardware. This approach is particularly useful for tasks like text classification, sentiment analysis, or any other application where we need to specialize a general-purpose language model.
Configure Training Arguments
Let’s now set up the training arguments using the TrainingArguments class from the Hugging Face Transformers library. These arguments define the training configuration, including hyperparameters and settings for saving and evaluating the model.
output_dir="./results": Specifies the directory where the model predictions and checkpoints will be saved.
evaluation_strategy="epoch": The model will be evaluated at the end of each epoch. This allows for monitoring the model’s progress and adjusting the training process as needed.
save_strategy="epoch": The model checkpoints will be saved at the end of each epoch. This ensures that checkpoints are available for each complete pass through the dataset.
learning_rate=1e-5: Sets the initial learning rate for the optimizer. This rate determines how much the model’s weights are updated during training.
per_device_train_batch_size=16: The number of samples per device (e.g., GPU) to load for training.
per_device_eval_batch_size=16: The number of samples per device to load for evaluation.
num_train_epochs=5: The total number of training epochs. An epoch is one complete pass through the training dataset.
weight_decay=0.01: Applies L2 regularization to the model weights to prevent overfitting.
bf16=True: Enables mixed precision training using bfloat16, which can speed up training and reduce memory usage on compatible hardware.
report_to="none": Disables reporting to external services like WandB or TensorBoard. If you want to track metrics, you can set this to “wandb”, “tensorboard”, etc.
load_best_model_at_end=True: Ensures that the best model based on evaluation metrics is loaded at the end of training.
These settings provide a balanced configuration for training a model efficiently while ensuring that the best version of the model is saved and can be used for further evaluation or deployment. Adjust these parameters based on your specific use case and available computational resources.
Prepare the tokenizer
We will now set up the tokenizer to convert DNA sequences into numerical tokens that the model can process. The tokenizer is a crucial component in preparing the data for model training and inference: it transforms raw text into a format that can be processed by machine learning models.
We use the AutoTokenizer class from the Hugging Face Transformers library to load a pre-trained tokenizer. We specify the pre-trained model from which to load the tokenizer. This should match the model you plan to use for training or inference. This tokenizer will be configured to handle DNA sequences efficiently.
model_max_length=200: Sets the maximum length of the tokenized sequences. Sequences longer than this will be truncated, and shorter ones will be padded.
padding_side="right": Specifies that padding should be added to the right side of the sequences. This ensures that all sequences in a batch have the same length.
use_fast=True: Enables the use of the fast tokenizer implementation, which is optimized for speed and is suitable for most use cases.
trust_remote_code=True: Allows the tokenizer to execute custom code from the model repository, which may be necessary for some models that require specific preprocessing steps.
By configuring the tokenizer in this way, we ensure that our DNA sequences are properly tokenized and formatted for input into the model. This step is essential for preparing our data for efficient and effective model training and evaluation.
Let’s now tailor the tokenizer to better suit our specific use case, ensuring that the model processes sequences accurately and efficiently. Special tokens play a crucial role in defining how sequences are processed and interpreted by the model. Here, we sets:
the end-of-sequence (EOS) token, which indicates the end of a sequence. It is essential for tasks where the model needs to generate sequences or understand where a sequence ends.
the padding (PAD) token, which is used to pad sequences to a uniform length within a batch. Padding ensures that all sequences in a batch have the same length, which is necessary for efficient processing during training and inference
To finetune the model, we must provide a dataset to train the model. We will the data with the 1st transcription factor (tf0) in mouse from Zhou et al. 2024. The data is stored on GitHub.
We change the current working directory to the Mistral-DNA folder.
os.chdir("Mistral-DNA/")print(os.getcwd())
Let’s define experience and path to data variables
expe="tf/0"data_path=f"data/GUE/{expe}"
Prepare Datasets for Training and Validation
We now need to set up the datasets required for training and validating. Properly preparing these datasets is crucial for ensuring that the model finetunes effectively and generalizes well to new data.
We will use the files data_path folder we just defined:
train.csv for training
dev.csv for validation
Question
How is the content of each file?
train.csv
dev.csv
The 2 files are CSV files with 2 columns (sequence and label) and different number of rows:
train.csv: 32,379 rows.
dev.csv: 1,000 rows
Values in label are:
0: The DNA sequence in sequence column does not bind to the 1st transcription factor.
1: The DNA sequence in sequence column binds to the transcription factor.
We use the SupervisedDataset class to load and prepare the datasets. This class handles the tokenization and formatting of the data, making it ready for model training and evaluation.
This parameter is used to specify the length of k-mers (substrings of length k) to be considered in the dataset. A value of -1 typically means that no k-mer splitting is applied, and the sequences are processed as they are.
Configure Data Collation
A data collator ensures that sequences are properly padded and formatted, which is crucial for optimizing the training process.
We’ll use the DataCollatorForSupervisedDataset class to handle the collation of tokenized data. This collator will manage padding and ensure that all sequences in a batch are of uniform length.
Load and Configure the Model for Sequence Classification
Let’s now load the pre-trained model, a model originally trained for large language modeling tasks, not specifically for classification. To adapt it for our binary classification task, we will add a new classification head on top of the existing architecture. This head will consist of a single neuron that connects to the output of the language model, enabling it to classify whether a DNA sequence binds to a transcription factor (label 1) or not (label 0).
This additional layer, or classification head, is a simple neural network layer that takes the high-level features extracted by the language model and maps them to our binary classification output. It learns to weigh these features appropriately to make accurate predictions for our specific task.
We use the AutoModelForSequenceClassification class from the Hugging Face Transformers library to load the pre-trained model and set it up for our specific classification task:
num_labels=2: Sets the number of output labels to 2, corresponding to the binary classification task (binding or not binding to transcription factors).
output_hidden_states=False: Indicates that the model should not output hidden states. This is typically set to False unless you need access to the intermediate representations for further analysis.
quantization_config=bnb_config: Applies predefined quantization configuration to the model, which helps reduce memory usage and enables efficient training on consumer-grade hardware.
device_map="auto": Automatically determines the best device placement for the model’s layers, optimizing for available hardware (e.g., GPUs). If it finds a GPU, it will use a GPU. If there’s no GPU, it will not use the GPU
trust_remote_code=True: Allows the model to execute custom code from the model repository, which may be necessary for certain architectures or preprocessing steps.
To ensure that the model correctly handles padding tokens, we need to align the padding token configuration between the model and the tokenizer. This step is crucial for maintaining consistency during training and inference, especially when dealing with sequences of varying lengths:
model.config.pad_token_id=tokenizer.pad_token_id
Initialize the Trainer
We can now set up the Trainer to manage the training and evaluation process of our model. The Trainer class simplifies the training loop, handling many of the complexities involved in training deep learning models.
Before setting up the Trainer, we load custom function (compute_metrics) to compute metrics for Trainer stored in the a scriptPython/functions.py:
What do the callbacks = [EarlyStoppingCallback(early_stopping_patience=3)] parameter?
It adds an early stopping mechanism to the training process. This mechanism is designed to halt training when the model’s performance on the validation set stops improving, helping to prevent overfitting and conserve computational resources.
How Early Stopping Works?
Purpose: The primary goal of early stopping is to capture the model parameters when the loss reaches its minimum value during training. This is crucial because, after a certain point, continued training may lead to overfitting, where the model starts to perform worse on unseen data.
Patience Parameter: The early_stopping_patience=3 setting specifies that training should continue for three additional epochs after the model’s performance on the validation set stops improving. This “patience” period helps mitigate the effects of noise in the training process. Noise can cause temporary fluctuations in the loss, making it seem like the model has reached a local minimum when further training might yield better results.
Process: During training, the loss is monitored at each epoch. If the loss does not decrease for three consecutive epochs, training is stopped. However, if a better model with a lower loss is found within those three epochs, training continues. This approach ensures that the model has truly reached a robust local minimum, rather than being prematurely halted due to noise.
By incorporating early stopping with a patience of three epochs, you balance the need to find an optimal model with the risk of overfitting, ultimately leading to more efficient and effective training outcomes.
Ffor distributed training, where multiple GPUs or nodes are used to accelerate the training process, it is essential to do:
trainer.local_rank=training_args.local_rank
The local_rank parameter identifies the rank of the current process within its local node, enabling coordinated communication and synchronization between processes. This setup is crucial for managing tasks such as gradient synchronization and data partitioning, ensuring that each process operates on the correct portion of the model or dataset. By assigning the local rank from training_args to the Trainer, we facilitate efficient and scalable training, leveraging the full computational power of multi-GPU environments.
Start the training
Let’s start the training process for our model using the trainer.train() method:
trainer.train()
After launching trainer.train(), we can notice that the training process is significantly faster compared to training a model from scratch seen in ”” tutorial. This efficiency is due to the use of a pre-trained model, which has already undergone extensive training on large datasets using powerful computational resources. For example, pre-training a model on even a small portion of the human genome can take dozens of hours, but fine-tuning this model on a specific task, such as classifying DNA sequences, is much quicker. Fine-tuning leverages the pre-trained model’s foundational knowledge, allowing you to adapt it to new tasks with a smaller, labeled dataset. This approach not only saves time but also reduces the need for extensive computational power. By downloading a pre-trained model from platforms like Hugging Face and fine-tuning it on a local machine with a modest GPU, we can achieve high performance with minimal overhead, making advanced modeling techniques accessible for a wide range of applications.
Evaluate Model Performance
After successfully training the model, the next essential step is to evaluate its performance on a test dataset. This evaluation process is crucial for understanding how well the model generalizes to new, unseen data and for assessing its readiness for real-world applications.
Comment
If finetuning is too long, you can stop the training.
The test data is stored in data_path/test.csv, we prepare it as for training and validation data.
We then use the trainer.evaluate() method. This methods is designed to assess the model’s performance on a specified dataset, typically the test dataset, which contains data that the model has not encountered during training.
The method computes various evaluation metrics, such as accuracy, precision, recall, and F1 score, depending on the task and the configuration specified in compute_metrics. These metrics provide a comprehensive view of the model’s performance, highlighting its strengths and weaknesses.
The Trainer uses the data_collator to ensure that the test data is properly formatted and padded, maintaining consistency with the training process. This consistency is crucial for accurate evaluation.
The evaluation results are stored in the results variable, which contains the computed metrics. We can analyze these results to gain insights into the model’s performance and make informed decisions about further improvements or deployment.
In this tutorial, we explored the process of fine-tuning a large language model (LLM) for DNA sequence classification. By following the steps outlined, you have learned how to leverage pre-trained models to achieve efficient and effective classification of DNA sequences, specifically focusing on their binding affinity to transcription factors.
We began by configuring the fine-tuning process, ensuring that available computational resources were optimally utilized. This included specifying settings for quantization, configuring Accelerate for distributed training, and implementing LoRA for parameter-efficient fine-tuning. These steps were crucial for maximizing performance and minimizing computational overhead.
Next, we prepared the tokenizer and data, ensuring that DNA sequences were properly tokenized and formatted for model input. We created datasets for training and validation, and configured data collation to handle batch processing efficiently.
We then loaded and configured the model for sequence classification, adding a classification head to adapt the pre-trained model to our specific task. With the model and data prepared, we initialized the Trainer, which streamlined the training process by managing the training loop, evaluation, and checkpointing.
You've Finished the Tutorial
Please also consider filling out the Feedback Form as well!
Key points
Fine-tuning pre-trained LLMs reduces training time and computational needs, making advanced research accessible.
Techniques like LoRA enable fine-tuning on modest hardware, broadening access to powerful models.
Rigorous testing on unseen data confirms a model’s practical applicability and reliability.
Frequently Asked Questions
Have questions about this tutorial? Have a look at the available FAQ pages and support channels
Zhou, Z., Y. Ji, W. Li, P. Dutta, R. Davuluri et al., 2024 DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. https://arxiv.org/abs/2306.15006
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{statistics-genomic-llm-finetuning,
author = "Raphael Mourad and Bérénice Batut",
title = "Fine-tuning a LLM for DNA Sequence Classification (Galaxy Training Materials)",
year = "",
month = "",
day = "",
url = "\url{https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-finetuning/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol}
}
Congratulations on successfully completing this tutorial!
You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.