Formation of the Super-Structures on the Inactive X

Overview

question Questions
  • Histone modification?
  • How is a raw set of ChIP-seq data processed and analyzed?
  • Where are the enriched regions for H3K27me3, H3K4me3 and CTCF on the chrX?
objectives Objectives
  • Inspect the read quality
  • Trim low quality bases
  • Map reads on a reference genome
  • Assess the quality of a ChIP-seq experiment
  • Extract coverage files
  • Call enriched regions or peaks
requirements Requirements

time Time estimation: 3 hours

Introduction

Within a cell nucleus, the DNA is tightly-packed and the chromatin is spatially distributed with different levels and scales of organizations.

At the smallest scale, DNA is packaged into units called nucleosomes, made of eight histone proteins.

On a larger scale than nucleosomes, DNA is forming loops. DNA elements that would be otherwise separated by large distances can interact. The corresponding self-interacting (or self-associating) domains are found in many organisms: they are called Topologically Associating Domains (TADs) in mammalian cells. Mammalian chromosomes are also partitioned into two spatial compartments, labeled “A” and “B”, where regions in compartment A tend to interact preferentially with A compartment-associated regions than B compartment-associated ones. Similarly, regions in compartment B tend to associate with other B compartment-associated regions.

comment Tip: Learn more about chromosome conformation

To learn more about chromosome conformation and TADs, you can follow our HiC tutorial

In mammals, the X chromosome inactivation (XCI) balances the dosage of X-linked genes between females and males. The genes on the inactive X (Xi) chromosome are not expressed.

Binding certain proteins to each of the eight histone proteins may modify the chromatin structure and may result in changes in transcription level. For example, the H3K4me3 is adding 3 methyl-group of the 4th Lysine in the histone 3 amino-acid. This modification is known to activate the transcription on nearby genes by opening the chromatin. The H3K27me3 on the other hand is inactivating the transcription of the nearby genes:

Fadloun et al, 2013
Figure 1: Source: Fadloun et al, 2013

More H3K27me3 and less H3K4me3 on the Xi could explain a lower expression of the genes there.

It has been also observed that the Xi reconfigures uniquely into a specific chromosomal conformation. cohesins, condensins and CCCTC-binding factor (CTCF) play key roles in chromosomal architectures and TAD formation which are other potential cause of the repression of the expression of the genes on Xi.

The structural-maintenance-of-chromosomes hinge domain containing 1 (SMCHD1) has been found enriched on the Xi. It may be the potential actor in the shape of Xi and the change in gene expression there.

Wang et al. (2018) investigates the mechanism by which the SMCHD1 gene shapes the Xi and represses the expression of the genes on Xi in mouse.

Their idea was to identify the differences which could be observed between the Xi and activated X chromosome, on both wild-type and SMCHD1 gene knockdown samples to study the SMCHD1 effect. In different experiments, they targeted histones with H3K27me3 or H3K4me3 and CTCF using ChIP-seq experiments.

ChIP-seq workflow
Figure 2: Source: http://e.biohackers.net/ChIP-seq

They obtained sequences corresponding to a portion of DNA linked to histones with H3K27me3, H3K4me3 or CTCF. Using this information, they could identify if there are differences in the H3K27me3, H3K4me3 and CTCF between the X (active or inactive) chromosomes and the potentially influenced genes.

In the upcoming tutorial, we will only use the wild type data from Wang et al. (2018) and analyze the ChIP-seq data step by step:

  • CTCF with 2 replicates: wt_CTCF_rep1 and wt_CTCF_rep2
  • H3K4me3 with 2 replicates: wt_H3K4me3_rep1 and wt_H3K4me3_rep2
  • H3K27me3 with 2 replicates: wt_H3K27me3_rep1 and wt_H3K27me3_rep2
  • ‘input’ with 2 replicates: wt_input_rep1 and wt_input_rep2

    In ‘input’ samples, the same treatment as the ChIP-seq samples was done except for the immunoprecipitation step. They are also used along with the ‘ChIP-seq’ samples to identify the potential sequencing bias and help for differential analysis.

Agenda

  1. Step 1: Quality control and treatment of the sequences
  2. Step 2: Mapping of the reads
    1. Running Bowtie2
    2. Inspection of a BAM file
    3. Visualization using a Genome Browser
  3. Step 3: ChIP-seq Quality Control
    1. Correlation between samples
    2. IP strength estimation
  4. Step 4: Normalization
    1. Generation of coverage files normalized by sequencing depth
    2. Generation of input-normalized coverage files
  5. Step 5: Detecting enriched regions (peak calling)
  6. Step 6: Plot the signal between samples

Step 1: Quality control and treatment of the sequences

The first step of any ChIP-Seq data analysis is quality control of the raw sequencing data.

To save time, we will do it only on the data of one sample wt_H3K4me3_rep1 which has been already down-sampled. keep in mind that with real data this should be done on each and every sample.

hands_on Hands-on: Import the data

  1. Create a new history for this tutorial and give it a proper name

    tip Tip: Creating a new history

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Click on Create New

    tip Tip: Renaming a history

    1. Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
    2. Type the new name
    3. Press Enter
  2. Import wt_H3K4me3_read1.fastq.gz and wt_H3K4me3_read2.fastq.gz from Zenodo or from the data library (ask your instructor)

    https://zenodo.org/record/1324070/files/wt_H3K4me3_read1.fastq.gz
    https://zenodo.org/record/1324070/files/wt_H3K4me3_read2.fastq.gz
    
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    By default, Galaxy uses the URL as the name, so rename the files with a more useful name.

    tip Tip: Importing data from a data library

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    • Go into Shared data (top panel) then Data libraries

    • Find the correct folder (ask your instructor)

    • Select the desired files
    • Click on the To History button near the top and select as Datasets from the dropdown menu
    • In the pop-up window, select the history you want to import the files to (or create a new one)
    • Click on Import

    As default, Galaxy takes the link as name, so rename them.

  3. Rename the files wt_H3K4me3_read1 and wt_H3K4me3_read2

    tip Tip: Renaming a dataset

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button
  4. Inspect the first file by clicking on the galaxy-eye (eye) icon (View data)

question Questions

  1. How are the DNA sequences stored?
  2. What are the other entries of the file?

solution Solution

  1. The DNA sequences are stored in a FASTQ file, in the second line of every 4-line group.
  2. This file format is called FASTQ format. It stores sequence information and quality information. Each sequence is represented by a group of 4 lines with the 1st line being the sequence id, the second the sequence of nucleotides, the third a transition line and the last one a sequence of quality score for each nucleotide.

During sequencing, errors are introduced, such as incorrect nucleotides being called. These are due to the technical limitations of each sequencing platform. Sequencing errors might bias the analysis and can lead to a misinterpretation of the data.

Sequence quality control is therefore an essential first step in your analysis. We use here similar tools as described in “Quality control” tutorial: FastQC and Trim Galore.

hands_on Hands-on: Quality control

  1. Run FastQC tool with the following parameters
    • param-files “Short read data from your current history”: wt_H3K4me3_read1 and wt_H3K4me3_read2 (Input datasets selected with Multiple datasets)

    tip Tip: Select multiple datasets

    1. Click on param-files Multiple datasets
    2. Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest
  2. Inspect the generated HTML files

    question Questions

    1. How is the quality of the reads in wt_H3K4me3_read1?
    2. And in wt_H3K4me3_read2?
    3. What should we do if the quality of the reads is not good?

    solution Solution

    1. The reads in wt_H3K4me3_read1 are of good quality:
      • There is 50,000 sequences, all of 51 bp
      • The “Per base sequence quality” is not decreasing too much at the end of the sequences

        Per base sequence quality for read1
        Figure 3: Per base sequence quality
      • The mean quality score over the reads is quite high

        Per sequence quality scores for read1
        Figure 4: Per sequence quality scores
      • Homogeneous percentage of the bases

        Per base sequence content for read1
        Figure 5: Per base sequence content
      • No N in the reads

        Per base N content for read1
        Figure 6: Per base N content
      • No duplicated sequences

        Sequence Duplication Levels for read1
        Figure 7: Sequence Duplication Levels
      • No more known adapters

        Adapter Content for read1
        Figure 8: Adapter Content
    2. The reads in wt_H3K4me3_read2 are a bit worse:

      • The “Per base sequence quality” is decreasing more at the end of the sequences, but it stays correct

        Per base sequence quality for read2
        Figure 9: Per base sequence quality
      • The sequence NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN is over represented

      Despite these small things, the overall quality of the reads is really good.

    3. If the quality of the reads is not good, we should:
      1. Check what is wrong and think about it: it may come from the type of sequencing or what we sequenced (high quantity of overrepresented sequences in transcriptomics data, biaised percentage of bases in HiC data)
      2. Ask the sequencing facility about it
      3. Perform some quality treatment (in a reasonable way to not loose too much information) with some trimming or removal of bad reads

It is often necessary to trim sequenced read, for example, to get rid of bases that were sequenced with high uncertainty (= low quality bases) at the read ends.

question Questions

What is the relation between wt_H3K4me3_read1 and wt_H3K4me3_read2 ?

solution Solution

The data has been sequenced using paired-end sequencing.

The paired-end sequencing is based on the idea that the initial DNA fragments (longer than the actual read length) is sequenced from both sides. This approach results in two reads per fragment, with the first read in forward orientation and the second read in reverse-complement orientation. The distance between both reads is known. Thus, it can be used as an additional piece of information to improve the read mapping.

With paired-end sequencing, each fragment is more covered than with single-end sequencing (only forward orientation sequenced):

    ------>
      ------>
        ------>
          ------>

    ----------------------------- [fragment]

    ------>         <------
      ------>         <------
        ------>         <------
          ------>         <------

The paired-end sequencing generates then 2 files:

  • One file with the sequences corresponding to forward orientation of all the fragments
  • One file with the sequences corresponding to reverse orientation of all the fragments

Here wt_H3K4me3_read1 corresponds to the forward reads and wt_H3K4me3_read2 to the reverse reads.

hands_on Hands-on: Trimming low quality bases

  1. Run Trim Galore! tool with the following parameters
    • “Is this library paired- or single-end?”: Paired-end
      • param-file “Reads in FASTQ format”: wt_H3K4me3_read1 (Input dataset)
      • param-file “Reads in FASTQ format”: wt_H3K4me3_read2 (Input dataset)

      The order is important here!

      tip Tip: Not selectable files?

      If your FASTQ files cannot be selected, you might check whether their format is FASTQ with Sanger-scaled quality values (fastqsanger). You can edit the data type by clicking on the pencil symbol.

    • “Trim Galore! advanced settings”: Full parameter list
      • “Trim low-quality ends from reads in addition to adapter removal”: 15
      • “Overlap with adapter sequence required to trim a sequence”: 3
      • “Generate a report file”: Yes
  2. Inspect the generated txt file (report file)

    question Questions

    1. How many basepairs has been removed from the forwards reads because of bad quality?
    2. And from the reverse reads?
    3. How many sequence pairs have been removed because at least one read was shorter than the length cutoff?

    solution Solution

    1. 32,198 bp (1.3%) (first Quality-trimmed:)
    2. 116,414 bp (4.6%) (second Quality-trimmed:). It is not a surprise: we saw that the quality was dropping more at the end of the sequences for the reverse reads thant for the forward reads.
    3. 1569 (3.14%) sequences (last line of the file)

Step 2: Mapping of the reads

With ChiP sequencing, we obtain sequences corresponding to a portion of DNA linked to the histone mark of interest, H3K4me3 in this case. As H3K4me3 opens the chromatime, nearby genes are gioing to be more transcribed. It would be interesting to know if there is a difference in the quantity of DNA impacted by H3K4me3 and the impacted genes between active and inactive X chromosome.

We first need to figure out where the sequenced DNA fragments originated from in the genome. The reads must be aligned to a reference genome to identify the binding sites.

This is equivalent to solving a jigsaw puzzles, but unfortunately, not all pieces are unique.

In principle, we could do a BLAST analysis to figure out where the sequenced pieces fit best in the known genome. Aligning millions of short sequences this way may, however, take a couple of weeks. And we do not really care about the exact base to base correspondence (alignment). What we are really interested in is “where these reads came from”. This approach is called mapping.

Nowadays, there are many mapping programs for shotgun sequenced DNA. We will use Bowtie2.

question Questions

  1. What is a reference genome?
  2. For each model organism, several possible reference genomes may be available (e.g. hg19 and hg38 for human). What do they correspond to?
  3. Which reference genome should we use?

solution Solution

  1. A reference genome (or reference assembly) is a set of nucleic acid sequences assembled as a representative example of a species’ genetic material. As they are often assembled from the sequencing of different individuals, they do not accurately represent the set of genes of any single organism, but a mosaic of different nucleic acid sequences from each individual.
  2. As the cost of DNA sequencing falls, and new full genome sequencing technologies emerge, more genome sequences continue to be generated. Using these new sequences, new alignments are builts and the reference genomes improved (fewer gaps, fixed misrepresentations in the sequence, etc). The different reference genomes correspond to the different released versions (called “builds”).
  3. Wang et al. (2018) did ChIP-seq on mouse. So we should use the mouse reference genome. We will use mm10 (the latest build)

Running Bowtie2

hands_on Hands-on: Mapping

  1. Bowtie2 tool with
    • “Is this single or paired library”: Paired-end
      • param-file “FASTA/Q file #1”: trimmed reads pair 1 (output of Trim Galore! tool)
      • param-file “FASTA/Q file #2”: trimmed reads pair 2 (output of Trim Galore! tool)
    • “Will you select a reference genome from your history or use a built-in index?”: Use a built-in genome index
      • “Select reference genome”: Mouse (Mus musculus): mm10
    • “Save the bowtie2 mapping statistics to the history”: Yes
  2. Inspect the mapping stats

    question Questions

    How many reads where mapped? Uniquely or several times?

    solution Solution

    The overall alignment rate is 98.64%. This score is quite high. If you have less than 70-80%, you should investigate the cause: contamination, etc.

    43719 (90.27%) reads have been aligned concordantly exactly 1 time and 3340 (6.90%) aligned concordantly >1 times. The latter ones correspond to multiple mapped reads. Allowing for multiple mapped reads increases the number of usable reads and the sensitivity of peak detection; however, the number of false positives may also increase.

The output of Bowtie2 is a BAM file.

Inspection of a BAM file

A BAM (Binary Alignment Map) file is a compressed, binary file storing the sequences mapped to a reference sequence.

hands_on Hands-on: Inspect a BAM/SAM file

  1. Inspect the param-file output of Bowtie2 tool

A BAM file (or a SAM file, the non compressed version) consists of:

  • A header section (the lines starting with @) containing metadata, in particular the chromosome names and lengths (lines starting with the @SQ symbol)
  • An alignment section consisting of a table with 11 mandatory fields, as well as a variable number of optional fields:

    Col Field Type Brief Description
    1 QNAME String Query template NAME
    2 FLAG Integer bitwise FLAG
    3 RNAME String References sequence NAME
    4 POS Integer 1- based leftmost mapping POSition
    5 MAPQ Integer MAPping Quality
    6 CIGAR String CIGAR String
    7 RNEXT String Ref. name of the mate/next read
    8 PNEXT Integer Position of the mate/next read
    9 TLEN Integer observed Template LENgth
    10 SEQ String segment SEQuence
    11 QUAL String ASCII of Phred-scaled base QUALity+33

question Questions

  1. Which information do you find in a SAM/BAM file?
  2. What is the additional information compared to a FASTQ file?

solution Solution

  1. Sequences and quality information, like a FASTQ
  2. Mapping information, Location of the read on the chromosome, Mapping quality, etc

Visualization using a Genome Browser

hands_on Hands-on: Visualization of the reads in IGV

  1. Install IGV (if not already installed)
  2. Launch IGV on your computer
  3. Expand the param-file output of Bowtie2 tool
  4. Click on the local in display with IGV to load the reads into the IGV browser
  5. Zoom on the chr2:91,053,413-91,055,345

The reads have a direction: they are mapped to the forward or reverse strand, respectively. When hovering over a read, extra information is displayed

question Questions

  1. What could it mean if a bar in the coverage view is colored?
  2. What could be the reason why a read is white instead of grey?

solution Solution

  1. If a nucleotide differs from the reference sequence in more than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base.
  2. They have a mapping quality equal to zero. Interpretation of this mapping quality depends on the mapping aligner as some commonly used aligners use this convention to mark a read with multiple alignments. In such a case, the read also maps to another location with equally good placement. It is also possible the read could not be uniquely placed but the other placements do not necessarily give equally good quality hits.

comment Tips for IGV

  1. Because the number of reads over a region can be quite large, the IGV browser by default only allows to see the reads that fall into a small window. This behaviour can be changed in the IGV from view > Preferences > Alignments.
  2. If the genome of your interest is not there check if it is available via More…. If this is not the case, you can add it manually via the menu Genomes -> Load Genome from…

    Select genome in IGV

A general description of the user interface of the IGV browser is available here: IGV Browser description

Step 3: ChIP-seq Quality Control

We already checked the quality of the raw sequencing reads in the first step. Now we would like to test the quality of the ChIP-seq preparation, to know if our ChIP-seq samples are more enriched than the control (input) samples.

Correlation between samples

To assess the similarity between the replicates of the ChIP and the input, respectively, it is a common technique to calculate the correlation of read counts on different regions for all different samples. We expect that the replicates of the ChIP-seq experiments should be clustered more closely to each other than the replicates of the input sample. That is, because the input samples should not have enriched regions included - remember the immuno-precipitation step was skiped during the sample preparation.

To compute the correlation between the samples we are going to to use the QC modules of deepTools, a software package for the QC, processing and analysis of NGS data. Before computing the correlation a time consuming step is required, which is to compute the read coverage (number of unique reads mapped at a given nucleotide) over a large number of regions from each of the inputed BAM files. For this we will use the tool multiBamSummary tool. Then, we use plotCorrelation tool from deepTools to compute and visualize the sample correlation. This is a fast process that allows to quickly try different color combinations and outputs.

Since in this tutorial we are interested in assessing H3K4me3, H3K27me3 and CTCF ChIP samples, the previous steps (quality control and mapping) needs to be run on all the replicates of ChIP samples as well as the input samples. To save time, we have already done that and you can now work directly on the BAM files of the provided 8 samples. For simplicity, the files include only the ChrX.

hands_on Hands-on: Correlation between samples

  1. Create a new history
  2. Import the 8 BAM files from Zenodo or from the data library into the history

    https://zenodo.org/record/1324070/files/wt_CTCF_rep1.bam
    https://zenodo.org/record/1324070/files/wt_CTCF_rep2.bam
    https://zenodo.org/record/1324070/files/wt_H3K4me3_rep1.bam
    https://zenodo.org/record/1324070/files/wt_H3K4me3_rep2.bam
    https://zenodo.org/record/1324070/files/wt_H3K27me3_rep1.bam
    https://zenodo.org/record/1324070/files/wt_H3K27me3_rep2.bam
    https://zenodo.org/record/1324070/files/wt_input_rep1.bam
    https://zenodo.org/record/1324070/files/wt_input_rep2.bam
    
  3. Rename the files
  4. multiBamSummary tool with the following parameters
    • “Sample order matters”: No
      • param-files “BAM/CRAM file”: the 8 imported BAM files
    • “Choose computation mode”: Bins
      • “Bin size in bp”: 1000

        This corresponds to the length of the fragments that were sequenced; it is not the read length!

      • “Distance between bins”: 500

        It reduces the computation time for the tutorial

    • “Region of the genome to limit the operation to”: chrX

    Using these parameters, the tool will take bins of 1000 bp separated by 500 bp on the chromosome X. For each bin the overlapping reads in each sample will be computed and stored into a matrix.

  5. plotCorrelation tool with the following parameters
    • param-files “Matrix file from the multiBamSummary tool”: correlation matrix(output of multiBamSummary tool)
    • “Correlation method”: Pearson

    Feel free to try different parameters for the configuration of the plot (colors, title, etc)

question Questions

Output for plotCorrelation with the correlation scores between the 8 samples
Figure 10: Correlation scores between the 8 samples

How are your samples clustered? Does that correspond to your expectations?

solution Solution

As one could expect, the input replicates cluster together and the ChIP replicates cluster together. It confirms that the immuno-precipitation step worked on our ChIP replicates.

Moreover, for each sample, there is a high correlation between the two replicates which confirms the validity of the experiments.

IP strength estimation

To evaluate the quality of the immuno-precipitation(IP) step, we can compute the IP strength. It determines how well the signal in the ChIP-seq sample can be differentiated from the background distribution of reads in the control sample (‘input’). After all, around 90% of all DNA fragments in a ChIP experiment will represent the genomic background.

To do that we take the data from the rep1 of the wt_H3K4me3 ChIP-seq sample and compare it with its corresponding input sample, using plotFingerprint tool of deepTools.

Similar to multiBamSummary tool, plotFingerprint tool randomly samples genome regions of a specified length (bins) and sums the per-base coverage in the indexed BAM files that overlap with those regions. These coverage values are then sorted according to their rank and the cumulative sum of read counts is plotted.

hands_on Hands-on: IP strength estimation

  1. plotFingerprint tool with the following parameters
    • “Sample order matters”: No
      • param-files “BAM/CRAM file”: wt_input_rep1 and wt_H3K4me3_rep1
    • “Region of the genome to limit the operation to”: chrX
    • “Show advanced options”: Yes
      • “Number of samples”: 10000

The plotFingerprint tool generates a fingerprint plot. You need to intepret it to know the IP strength.

An ideal ‘input’ with perfect uniform distribution of reads along the genome (i.e. without enrichments in open chromatin) and infinite sequencing coverage should generate a straight diagonal line. A very specific and strong ChIP enrichment will be indicated by a prominent and steep rise of the cumulative sum towards the highest rank. This means that a big chunk of reads from the ChIP sample is located in few bins which corresponds to high, narrow enrichments typically seen for transcription factors.

A guide to interpret a fingerprint plot
Figure 11: How to interpret a fingerprint plot? Image extracted from the deepTools documentation

question Questions

Output for plotFingerprint with the fingerprint plot to estimate the IP strength
Figure 12: Fingerprint plot for the first replicates to estimate the IP strength

What do you think about the quality of the IP for this experiment?

solution Solution

There is clear distinction between H3K4me3 and the input.

A small percentage of the genome contain a very large fraction of the reads (>70%, point of change in the blue curve)

The curves start to rise around 0.25. It means that almost 25% of the chromosome X is not sequenced at all.

hands_on (Optional) Hands-on: IP strength estimation (other samples)

Run the same analysis on the other ChIP-seq data along with their corresponding input and compare the output

Step 4: Normalization

One of the goals in ChIP-seq data analysis is finding regions on the genome which are enriched for the ChIP data of interest (regions with significantly higher read coverage for the ChIP data comparing to its corresponding input). In the following exercise we would like to know where the H3K4me3 sites are. For this we need to extract which parts of the genome have been enriched (i.e. more reads mapped) within the samples that underwent immunoprecipitation. However, to reach a reliable comparison the data needs to be normalized to remove any technical bias. For the normalization we have two steps:

  1. Normalization by sequencing depth
  2. Normalization by input file

To learn how to do the normalization, we will take the wt_H3K4me3_rep1 sample as ChIP data and wt_input_rep1 as input.

Generation of coverage files normalized by sequencing depth

hands_on Hands-on: Estimation of the sequencing depth

  1. IdxStats tool with the following parameters
    • param-files “BAM file”: wt_H3K4me3_rep1.bam and wt_input_rep1.bam

question Questions

  1. What is the output of IdxStats tool?
  2. How many reads has been mapped on chrX for the input and for the ChIP-seq samples?
  3. Why are the number of reads different? And what could be the impact of this difference?

solution Solution

  1. This tool generates a table with 4 columns: reference sequence identifier, reference sequence length, number of mapped reads and number of placed but unmapped reads. Here it estimates how many reads mapped to which chromosome. Furthermore, it tells the chromosome lengths and naming convention (with or without ‘chr’ in the beginning)
  2. 1,204,821 for ChIP-seq samples and 1,893,595 for the input
  3. The number of reads can be different because of different sequencing depth. It can bias the interpretation of the number of reads mapped to a specific genome region and the identification of the H3K4me3 sites. Specially here, as the number of reads for the input is higher than the ChIP data less regions could be identified having a significantly higher read coverage for the ChIP data comparing to the corresponding input.

The different samples have usually a different sequencing depth, i.e. a different number of reads. These differences can bias the interpretation of the number of reads mapped to a specific genome region. We first need to make the samples comparable by normalizing the coverage by the sequencing depth.

We are using bamCoverage tool. Given a BAM file, this tool generates coverages by first calculating all the number of reads (either extended to match the fragment length or not) that overlap each bin in the genome and then normalizing with various options. It produces a coverage file where for each bin the number of overlapping reads (possibly normalized) is noted.

hands_on Hands-on: Coverage file normalization

  1. bamCoverage tool with the following parameters
    • param-files “BAM file”: wt_H3K4me3_rep1.bam and wt_input_rep1.bam
    • “Bin size in bases”: 25
    • “Scaling/Normalization method”: Normalize coverage to 1x
      • “Effective genome size”: GRCm38/mm10 (2308125349)
    • “Coverage file format”: bedgraph
    • “Region of the genome to limit the operation to”: chrX

    question Questions

    1. What is a bedgraph file?
    2. Which regions have the highest coverage in ChIP data and in the input?

    solution Solution

    1. It is a tabular file with 4 columns: chrom, chromStart, chromEnd and a data value (coverage)
    2. We can run Sort tool on the 4th column in descending order to get the regions with the highest (normalized) coverage. For wt_H3K4me3_rep1, the regions between 152,233,600 and 152,233,625 are the most covered. For wt_input_rep1, between 143,483,000 and 143,483,100.
  2. bamCoverage tool with the same parameters but
    • “Coverage file format”: bigWig

    question Questions

    What is a bigWig file?

    solution Solution

    A bigWig file is a compressed bedgraph file. Similar in relation as BAM to SAM, but this time just for coverage data. This means bigWig and bedgraph files are much smaller than BAM or SAM files (no sequence or quality information).

  3. IGV tool to inspect both signal coverages (input and ChIP samples) in IGV

question Questions

If you zoom to chrX:151,385,260-152,426,526, what do you observe?

Output of bamCoverage
Figure 13: bamCoverage for wt_H3K4me3_rep1 and wt_input_rep1 on chrX:151,385,260-152,426,526

solution Solution

The track with the coverage for the input (wt_input_rep1) seems quite homogeneous. On the other hand, for wt_H3K4me3_rep1, we can observe some clear peaks at the beginning of the genes.

Generation of input-normalized coverage files

To extract only the information induced by the immunoprecipitation, we normalize the coverage file for the sample that underwent immunoprecipitation by the coverage file for the input sample. Here we use the tool bamCompare tool which compare 2 BAM files while caring for sequencing depth normalization.

hands_on Hands-on: Generation of input-normalized coverage files

  1. bamCompare tool with the following parameters
    • param-file “First BAM file (e.g. treated sample)”: wt_H3K4me3_rep1.bam
    • param-file “Second BAM file (e.g. control sample)”: wt_input_rep1.bam
    • “Bin size in bases”: 50
    • “How to compare the two files”: Compute log2 of the number of reads ratio
    • “Coverage file format”: bedgraph
    • “Region of the genome to limit the operation to”: chrX

    question Questions

    1. What does a positive value or a negative value mean in the 4th column?
    2. Which regions have the highest coverage in the ChIP data? and the lowest?

    solution Solution

    1. The 4th column contains the log2 of the number of reads ratio between the ChIP-seq sample and the input sample. A positive value means that the coverage on this portion of genome is higher in the ChIP-seq sample than in the input sample
    2. The highest: 152,233,800-152,233,850 (consistent with the most covered regions in wt_H3K4me3_rep1 given by bamCoverage tool). The lowest: 169,916,600-169,916,650
  2. bamCompare tool with the same parameters but:
    • “Coverage file format”: bigWig
  3. IGV tool to inspect the log2 ratio

question Questions

How could you interpret the new track if you zoom to chrX:151,385,260-152,426,526?

Output of bamCoverage
Figure 14: bamCoverage for wt_H3K4me3_rep1 and wt_input_rep1 on chrX:151,385,260-152,426,526

solution Solution

The new track is the difference between the first track (wt_H3K4me3_rep1) and the second track (wt_input_rep1)

Step 5: Detecting enriched regions (peak calling)

We could see in the ChIP data some enriched regions (peaks). We now would like to call these regions to obtain their coordinates, using MACS2 callpeak tool

hands_on Hands-on: Peak calling

  1. MACS2 callpeak tool with the following parameters
    • “Are you pooling Treatment Files?”: No
      • param-file “ChIP-Seq Treatment File”: wt_H3K4me3_rep1.bam
    • “Do you have a Control File?”: Yes
      • “Are you pooling Treatment Files?”: No
        • param-file “ChIP-Seq Treatment File”: wt_input_rep1.bam
    • “Format of Input Files”: Paired-end BAM
    • “Effective genome size”: M.musculus (1.87e9)
    • “Outputs”: Summary page (html)

    comment Comments

    The advanced options may be adjusted, depending of the samples. If your ChIP-seq experiment targets regions of broad enrichment, e.g. non-punctuate histone modifications, select calling of broad regions. If your sample has a low duplication rate (e.g. below 10%), you might keep all duplicate reads (tags). Otherwise, you might use the ‘auto’ option to estimate the maximal allowed number of duplicated reads per genomic location.

  2. Inspect the param-file (narrow Peaks) file (output of MACS2 callpeak tool)

    question Questions

    Which type of files were generated? What do they include?

    solution Solution

    MACS2 callpeak tool has generated a bed file with the coordinates of the identified peaks: chromosome, start, stop, name, integer score, strand, fold-change, -log10pvalue, -log10qvalue and relative summit position to peak start, as well as a html report which contains links to additional bed and xls files.

  3. IGV tool to inspect with the signal coverage and log2 ratio tracks

question Questions

  1. How many peaks have been identified in chrX:151,385,260-152,426,526 based on IGV?

    Output of MACS2
    Figure 15: Peaks for wt_H3K4me3_rep1 and wt_input_rep1 on chrX:151,385,260-152,426,526
  2. What are the fold change of the peaks identified in chrX:151,385,260-152,426,526? Hint: using the BED file
  3. How many peaks have been identified on the full chromosome X? How many peaks have a fold change > 50?

solution Solution

  1. We can see 11 peaks (track below the genes).
  2. Using Filter tool with c2>151385260 and c3<152426526, we found that the 11 peaks with fold changes between 3.81927 and 162.06572
  3. On the 656 peaks on the full chromosome (number of lines of the original BED file) there are 252 peaks with FC>50 (using Filter tool with c7>50)

The called peak regions can be filtered by, e.g. fold change, FDR and region length for further downstream analysis.

Step 6: Plot the signal between samples

So far, we have normalized the data and identified peaks. Now, we would like to visualize scores associated with certain genomic regions, for example ChIP enrichment values around the TSS of genes. Moreover, we would like to compare the enrichment of several ChIP samples (e.g. CTCF and H3K4me3 )on the regions of interest.

Since we already generated the required files for the H3K4me3 sample, let’s make them only for the CTCF sample:

hands_on Hands-on: Prepare the peaks and data for CTCF

  1. bamCompare tool with the following parameters
    • param-file “First BAM file (e.g. treated sample)”: wt_CTCF_rep1.bam
    • param-file “Second BAM file (e.g. control sample)”: wt_input_rep1.bam
    • “Bin size in bases”: 50
    • “How to compare the two files”: Compute log2 of the number of reads ratio
    • “Coverage file format”: bigwig
    • “Region of the genome to limit the operation to”: chrX
  2. Rename the output of bamCompare tool with the name of the sample
  3. MACS2 callpeak tool with the following parameters
    • “Are you pooling Treatment Files?”: No
      • param-file “ChIP-Seq Treatment File”: wt_CTCF_rep1.bam
    • “Do you have a Control File?”: Yes
      • “Are you pooling Treatment Files?”: No
        • param-file “ChIP-Seq Treatment File”: wt_input_rep1.bam
    • “Format of Input Files”: Paired-end BAM
    • “Effective genome size”: M.musculus (1.87e9)

We can now concatenate the MACS2 outputs with the location of the peaks (concatenate the files and merge the overlapping regions) to obtain one BED file corresponding to the coordinates of the interesting regions to plot.

hands_on Hands-on: Prepare the peak coordinates

  1. Concatenate two datasets into one dataset tool with the following parameters
    • param-file “Concatenate”: output of MACS2 callpeak tool for wt_CTCF_rep1
    • param-file “with”: output of MACS2 callpeak tool for wt_H3K4me3_rep1
  2. SortBED tool with the following parameters
    • param-file “Sort the following bed,bedgraph,gff,vcf file”: output of Concatenate tool
  3. MergeBED tool with the following parameters
    • param-file “Sort the following bed,bedgraph,gff,vcf file”: output of SortBED tool

To plot the the peaks score on the region generated above (MergeBED output) two tools from the deepTools package are used:

  • computeMatrix tool: it computes the signal on given regions, using the bigwig coverage files from different samples.
  • plotHeatmap tool: it plots heatMap of the signals using the computeMatrix tool output.

Optionally, we can also use plotProfile tool to create a profile plot from computeMatrix tool output.

hands_on Hands-on: Plot the heatmap

  1. computeMatrix tool with the following parameters:
    • “Select regions”:
      • param-file “Regions to plot”: output of MergeBED tool
    • “Sample order matters”: No
      • param-files “Score file”: the 2 bigwig files generated by bamCompare tool and renamed
    • “computeMatrix has two main output options”: reference-point
      • “The reference point for the plotting”: center of region
      • “Distance upstream of the start site of the regions defined in the region file”: 3000
      • “Distance downstream of the end site of the given regions”: 3000
  2. plotHeatmap tool with the following parameters
    • param-file “Matrix file from the computeMatrix tool”: Matrix (output of computeMatrix tool)
    • “Show advanced options”: yes
      • “Reference point label”: select the right label
      • “Did you compute the matrix with more than one groups of regions?”: No, I used only one group
        • “Clustering algorithm”: Kmeans clustering
        • “Number of clusters to compute”: 2

It should generate an heatmap similar to:

Output of plotHeatmap for 2 samples
Figure 16: Scores around the peaks for wt_H3K4me3_rep1 and wt_input_rep1

When we look at this graph, it seems that less but larger peaks are found for H3K4me3_rep1 and that only few peaks are shared.

question Questions

  1. How many peaks have been found for CTCF_rep1 and for H3K4me3_rep1?
  2. What are the mean width of the peaks for CTCF_rep1 and for H3K4me3_rep1?
  3. How many peaks are specific to CTCF_rep1 or H3K4me3_rep1?

solution Solution

  1. 656 peaks for H3K4me3_rep1 and 2,688 for CTCF_rep1 (number of lines in the MACS2 callpeak tool BED file)
  2. 1630.77 bp for H3K4me3_rep1 and 404.55 for CTCF_rep1 (Compute tool with c3-c2 and then Datamash tool with Mean on Column:11)
  3. 443 peaks (over 656) are specific to H3K4me3_rep1 and 2,464 (over 2,688) to CTCF_rep1 (Intersect intervals tool). Around 220 peaks are then overlapping.

So far, we have only analyzed 2 samples, but we can do the same for all the 6 samples:

hands_on (Optional) Hands-on: Plot the heatmap for all the samples

  1. bamCompare tool for each combination input - ChIP data:
    1. wt_CTCF_rep1 - wt_input_rep1 (already done)
    2. wt_H3K4me3_rep1 - wt_input_rep1 (already done)
    3. wt_H3K27me3_rep1 - wt_input_rep1
    4. wt_CTCF_rep2 - wt_input_rep2
    5. wt_H3K4me3_rep2 - wt_input_rep2
    6. wt_H3K27me3_rep2 - wt_input_rep2
  2. Rename the outputs of bamCompare tool with the name of the ChIP data
  3. MACS2 callpeak tool for each combination input - ChIP data
  4. Concatenate datasets tail-to-head tool with the following parameters
    • param-file “Concatenate Dataset”: one output of MACS2 callpeak tool
    • Click “Insert Dataset” and param-file “Select” one other output of MACS2 callpeak tool
    • Redo for the 6 outputs of MACS2 callpeak tool
  5. SortBED tool with the following parameters
    • param-file “Sort the following bed,bedgraph,gff,vcf file”: output of Concatenate tool
  6. MergeBED tool with the following parameters
    • param-file “Sort the following bed,bedgraph,gff,vcf file”: output of SortBED tool
  7. computeMatrix tool with the same parameters but:
    • “Select regions”:
      • param-file “Regions to plot”: output of MergeBED tool
    • “Sample order matters”: No
      • param-files “Score file”: the 6 bigwig files generated by bamCompare tool and renamed
    • “computeMatrix has two main output options”: reference-point
      • “The reference point for the plotting”: center of region
      • “Distance upstream of the start site of the regions defined in the region file”: 3000
      • “Distance downstream of the end site of the given regions”: 3000
  8. plotHeatmap tool with the following parameters
    • param-file “Matrix file from the computeMatrix tool”: Matrix (output of computeMatrix tool)
    • “Show advanced options”: yes
      • “Reference point label”: select the right label
      • “Did you compute the matrix with more than one groups of regions?”: No, I used only one group
        • “Clustering algorithm”: Kmeans clustering
        • “Number of clusters to compute”: 2

question Questions

  1. How many peaks are found for the different samples?
  2. How are the peaks?

    Output of plotHeatmap for all samples
    Figure 17: Scores around the peaks for all samples
  3. How could be interpreted the peaks and read coverage in the chrX:151,385,260-152,426,526 region?

    Coverage and peaks for the replicate 1
    Figure 18: Coverage and peaks for the replicate 1

solution Solution

  1. Found peaks (number of lines in MACS2 callpeak tool outputs):

    Target Rep 1 Rep 2
    CTCF 2,688 2,062
    H3K4me3 656 717
    H3K27me3 221 76

    The tendencies are similar for both replicates: more peaks for CTCF, less for H3K4me3 and only few for H3K27me3.

  2. As observed with the 2 samples, the peaks for H3K4me3 are wider than for CTCF. We also observe that the peaks found with one replicate are found with the other replicate.
  3. The H3K4me3 sample has clear and large regions in which the read coverage are enriched. H3K4me3 is one of the least abundant histone modifications. It is highly enriched at active promoters near transcription start sites (TSS) and positively correlated with transcription.

    For H3K27me3, the coverage is more homogeneous. A gene is a broad domain of H3K27me3 enrichment across its body of genes corresponds to a gene with a transcription inhibited by H3K27me3. We can also identified some “bivalent” genes: gene with a peak around the TSS for H3K27me3(e.g. region_208 for gene Gpr173) but also H3K4me3. We also observe some H3K27me3-depleted regions sharply demarcated, with boundaries coinciding with gene borders (e.g. Kdm5c). This is a chromatin signature reminiscent of genes that escape XCI.

    To reproduce, run bamCoverage tool, IGV tool and MACS2 callpeak tool outputs.

Conclusion

Along this tutorial, we learn how to extract peaks and coverage information from raw data of ChIP experiments:

Workflow to extract peaks and coverage information from raw data of ChIP experiments
Figure 19: Workflow to extract peaks and coverage information from raw data of ChIP experiments

This information can be then related to the biology to answer the original question. We tried then to relate the observed differences of peak and read coverage between H3K27me3 and H3K4me3 to the known biology. We could go even further in the analysis to reproduce the results of the original paper (e.g. by looking at the bivalent genes, identifying the differences between Xa and Xi).

keypoints Key points

  • ChIP-seq data requires multiple methods of quality assessment to ensure that the data is of high quality.
  • Multiple normalization methods exists depending on the availability of input data.
  • Heatmaps containing all genes of an organism can be easily plotted given a BED file and a coverage file.

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

congratulations Congratulations on successfully completing this tutorial!