Mapping

Overview

question Questions
  • What is mapping?
  • What two things are crucial for a correct mapping?
  • What is BAM?
objectives Objectives
  • You will learn what mapping is
  • A genome browser is shown that helps you to understand your data
requirements Requirements

time Time estimation: 1 hour

Introduction

Sequencing produces a collection of sequences without genomic context. We do not know to which part of the genome the sequences correspond to. Mapping the reads of an experiment to a reference genome is a key step in modern genomic data analysis. With the mapping the reads are assigned to a specific location in the genome and insights like the expression level of genes can be gained.

The short reads do not come with position information, so we do not know what part of the genome they came from. We need to use the sequence of the read itself to find the corresponding region in the reference sequence. But the reference sequence can be quite long (~3 billion bases for human), making it a daunting task to find a matching region. Since our reads are short, there may be several, equally likely places in the reference sequence from which they could have been read. This is especially true for repetitive regions.

In principle, we could do a BLAST analysis to figure out where the sequenced pieces fit best in the known genome. We would need to do that for each of the millions of reads in our sequencing data. Aligning millions of short sequences this way may, however, take a couple of weeks. And we do not really care about the exact base to base correspondence (alignment). What we are really interested in is “where these reads came from”. This approach is called mapping.

In the following we will process a dataset with the mapper Bowtie2 and we will visualize the data with the program IGV.

Agenda

In this tutorial, we will deal with:

  1. Prepare the data
  2. Map reads on a reference genome
  3. Inspection of a BAM file
  4. Visualization using a Genome Browser

Prepare the data

hands_on Hands-on: Data upload

  1. Create a new history for this tutorial and give it a proper name

    tip Tip: Create a new history

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Click on Create New

    tip Tip: Rename an history

    1. Click on Unnamed history (or the name of the history) (Click to rename history) on the top of your history
    2. Write the new name
    3. Type on Enter
  2. Import wt_H3K4me3_read1.fastq.gz and wt_H3K4me3_read2.fastq.gz from Zenodo or from the data library (ask your instructor)

    https://zenodo.org/record/1324070/files/wt_H3K4me3_read1.fastq.gz
    https://zenodo.org/record/1324070/files/wt_H3K4me3_read2.fastq.gz
    
    • Copy the link location
    • Open the Galaxy Upload Manager
    • Select Paste/Fetch Data
    • Paste the link into the text field
    • Press Start and Close the window

    tip Tip: Importing data from a data library

    • Go into Shared data (top panel) then Data libraries
    • Find the correct folder (ask your instructor)
    • Select interesting files
    • Click on Import selected datasets into history
    • Import in the history

    As default, Galaxy takes the link as name, so rename them.

  3. Rename the files to reads_1 and reads_2

    tip Tip: Rename a dataset

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button

We just imported in Galaxy FASTQ files corresponding to paired-end data as we could get directly from a sequencing facility.

During sequencing, errors are introduced, such as incorrect nucleotides being called. Sequencing errors might bias the analysis and can lead to a misinterpretation of the data. The first step for any type of sequencing data is to check their quality.

comment Check the Quality control tutorial

The quality control tutorial is explaining this step. We will not going into the details here, in particular for the parameters.

hands_on Hands-on: Quality control

  1. FastQC tool on both datasets
  2. MultiQC tool on the outputs of FastQC tool
  3. Trim Galore! tool on the paired-end datasets

Map reads on a reference genome

Read mapping is the process to align the reads on a reference genomes. A mapper takes as input a reference genome and a set of reads. Its aim is to align each read in the set of reads on the reference genome, allowing mismatches, indels and clipping of some short fragments on the two ends of the reads:

Explanation of mapping
Figure 1: Illustration of the mapping process. The input consists of a set of reads and a reference genome. It the middle, it gives the results of mapping: the locations of the reads on the reference genome. The first read is aligned at position 100 and the alignment has two mismatches. The second read is aligned at position 114. It is a local alignment with clippings on the left and on the right. The third read is aligned at position 123. It consists of a 2-base insertion and a 1-base deletion.

We need a reference genome to map the reads on.

question Questions

  1. What is a reference genome?
  2. For each model organism, several possible reference genomes may be available (e.g. hg19 and hg38 for human). What do they correspond to?
  3. Which reference genome should we use?

solution Solution

  1. A reference genome (or reference assembly) is a set of nucleic acid sequences assembled as a representative example of a species’ genetic material. As they are often assembled from the sequencing of different individuals, they do not accurately represent the set of genes of any single organism, but a mosaic of different nucleic acid sequences from each individual.
  2. As the cost of DNA sequencing falls, and new full genome sequencing technologies emerge, more genome sequences continue to be generated. Using these new sequences, new alignments are builts and the reference genomes improved (fewer gaps, fixed misrepresentations in the sequence, etc). The different reference genomes correspond to the different released versions (called “builds”).
  3. This data comes from ChIP-seq of mices, so we will use mm10 (Mus musculus).

Currently, there are over 60 different mappers, and their number is growing. In this tutorial, we will use Bowtie2, a fast and memory-efficient open-source tool particularly good at aligning sequencing reads of about 50 up to 1,000s of bases to relatively long genomes.

hands_on Hands-on: Mapping with Bowtie2

  1. Bowtie2 tool with the following parameters
    • “Is this single or paired library”: Paired-end
      • param-file “FASTA/Q file #1”: trimmed reads pair 1 (output of Trim Galore! tool)
      • param-file “FASTA/Q file #2”: trimmed reads pair 2 (output of Trim Galore! tool)
      • “Do you want to set paired-end options?”: No

        You should have a look at the parameters there, specially the mate orientation if you know it. They can improve the quality of the paired-end mapping.

    • “Will you select a reference genome from your history or use a built-in index?”: Use a built-in genome index
      • “Select reference genome”: Mouse (Mus musculus): mm10
    • “Select analysis mode”: Default setting only

      You should have a look at the non default parameters and try to understand them. They can have an impact on the mapping and improving it.

    • “Save the bowtie2 mapping statistics to the history”: Yes
  2. Inspect the mapping stats file by clicking on the galaxy-eye (eye) icon

    question Questions

    1. What information is provided here?
    2. How many reads have been mapped exactly 1 time?
    3. How many reads have been mapped more than 1 time? How is it possible? What should we do with them?
    4. How many pair of reads have not been mapped? What are the causes?

    solution Solution

    1. The information given here is a quantity one. We can see how many sequences are aligned. It does not tell us something about the quality.
    2. ~90% reads have been aligned exactly 1 time
    3. ~7% reads have been aligned concordantly >1 times. These are called multi-mapped reads. It can happen because of repetitions in the reference genome (multiple copies of a gene for example), specially when the reads are small. It is difficult to decide where these sequences come from so most pipelines ignores them. Always check the statistics there to be sure of not removing to much information by discarding them in any downstream analyses.
    4. ~3% pair of reads have not been mapped because
      • both reads in the pair aligned but their positions do not concord with pair of reads (aligned discordantly 1 time)
      • reads of these pairs are multi-mapped (aligned >1 times in pairs aligned 0 times concordantly or discordantly)
      • one read of these paires are mapped but not the paired read (aligned exactly 1 time in pairs aligned 0 times concordantly or discordantly)
      • the rest are not mapped at all

Checking the mapping statistics is an important step to do before continuing any analyses. There are several potential sources for errors in mapping, including (but not limited to):

So if the mapping statistics are not good, you should investigate the cause of these errors before going further in your analyses.

After that, you should have a look at the reads and inspect the BAM file where the read mappings are stored.

Inspection of a BAM file

A BAM (Binary Alignment Map) file is a compressed, binary file storing the sequences mapped to a reference sequence.

hands_on Hands-on: Inspect a BAM/SAM file

  1. Inspect the param-file output of Bowtie2 tool

A BAM file (or a SAM file, the non compressed version) consists of:

question Questions

  1. Which information do you find in a SAM/BAM file?
  2. What is the additional information compared to a FASTQ file?

solution Solution

  1. Sequences and quality information, like a FASTQ
  2. Mapping information, Location of the read on the chromosome, Mapping quality, etc

So the BAM file integrates many information for each read, in particular the quality of mapping.

hands_on Hands-on: Summary of mapping quality

  1. Stats generate statistics for BAM dataset tool with the following parameters
    • param-file “BAM file”: aligned reads (output of Bowtie2 tool)
    • “Use reference sequence”: Use reference
      • “Choose a reference sequence for GC depth”: Locally cached
        • “Using genome”: Mouse (Mus musculus): mm10 Full
  2. Inspect the param-file Stats file

    question Questions

    1. Which proportion of mismatches are in the mapped reads when aligned to the reference genome?
    2. What does the error rate represent?
    3. What is the average quality? How is it represented?
    4. What is the insert size average?
    5. How many reads have a mapping quality score below 20?

    solution Solution

    1. There are ~21,900 mismatches for ~4,753,900 bases mapped so ~ mismatches per mapped bases
    2. The error rate is the proportion of mismatches per mapped bases, so the ratio computed right before
    3. The average quality is the mean quality score of the mapping. It is a Phred score, as the one used in the FASTQ file for each nucleotide. But here the score is not per nucleotide, but per read. And it represents the probability of mapping quality
    4. The insert size is the distance between the two reads in the pairs.
    5. To get the info:
      1. Filter BAM datasets on a variety of attributes tool with a filter to keep only the reads with a mapping quality >= 20
      2. Stats generate statistics for BAM dataset tool on the output of Filter

      Before filtering: 95,412 reads - After filtering: 89,664 reads

Visualization using a Genome Browser

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations. In the following we will use it to visualize the mapped reads.

hands_on Hands-on: Visualization of the reads in IGV

  1. Install IGV (if not already installed)
  2. Launch IGV on your computer
  3. Expand the param-file output of Bowtie2 tool
  4. Click on the local in display with IGV to load the reads into the IGV browser
  5. Zoom on the chr2:91,053,413-91,055,345

The reads have a direction: they are mapped to the forward or reverse strand, respectively. When hovering over a read, extra information is displayed

question Questions

  1. What could it mean if a bar in the coverage view is colored?
  2. What could be the reason why a read is white instead of grey?

solution Solution

  1. If a nucleotide differs from the reference sequence in more than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base.
  2. They have a mapping quality equal to zero. Interpretation of this mapping quality depends on the mapping aligner as some commonly used aligners use this convention to mark a read with multiple alignments. In such a case, the read also maps to another location with equally good placement. It is also possible the read could not be uniquely placed but the other placements do not necessarily give equally good quality hits.

tip Tips for IGV

  1. Because the number of reads over a region can be quite large, the IGV browser by default only allows to see the reads that fall into a small window. This behaviour can be changed in the IGV from view > Preferences > Alignments.
  2. If the genome of your interest is not there check if it is available via More…. If this is not the case, you can add it manually via the menu Genomes -> Load Genome from…

    Select genome in IGV

A general description of the user interface of the IGV browser is available here: IGV Browser description

Conclusion

After quality control, mapping is an important step of most analyses of sequencing data (RNA-Seq, ChIP-Seq, etc) to determine where in the genome our reads originated from and use this information for downstream analyses.

keypoints Key points

  • Know your data!
  • Mapping is not trivial
  • There are many mapping algorithms, it depends on your data which one to choose

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

congratulations Congratulations on successfully completing this tutorial!