scRNA-seq_preprocessing_10X_cellPlex

This workflow processes the CMO fastqs with CITE-seq-Count and include the translation step required for cellPlex processing. In parallel it processes the Gene Expresion fastqs with STARsolo, filter cells with DropletUtils and reformat all outputs to be easily used by the function 'Read10X' from Seurat.

Single-cell RNA-seq fastq to matrix for 10X data

These workflows are inspired by the training material. Except that the output is in a 'bundle' format: three files (one matrix, one with genes, one with barcodes) which is similar to the cellranger output format.

Both are designed for fastqs from 10X libraries v3. One is for regular 10X library (one library per sample), while the other one is for CellPlex 10X library which allows to multiplex samples using CMOs (see this blog article).

Input datasets

  • Specific for each experiment:

    • For both workflows: you need a list of pairs of fastqs with gene expression.
    • For CellPlex: you need in addition a list of pairs of fastqs with CMO.
    • For CellPlex: you need a list of csv which describes samples and CMO used:
      • first column is the sequence and second column is the name /!\ The order of samples need to be exactly the same between the collection of fastqs of CMO and the collection of csv.
  • Common for all experiments:

Input values

  • reference genome: this genome needs to be available for STAR
  • Barcode Size is same size of the Read: if the length of your R1 of GEX matches the size of cell barcode + UMI set to true. If your R1 contains trailling A, put false.
  • number of cells: If you make it too large no cell barcode correction will be performed to demultiplex CMOs.

Processing

  • Gene expression processing:
    • Reads are aligned to the genome, asigned to genes, cell barcode and UMI with STAR Solo
    • MultiQC report the mapping rate and the number of reads attributed to genes
    • The output of STAR Solo is filtered with Droplet Utils to remove cellular barcodes which are probably empty.
    • The output of Droplet Utils is reorganized to be:
Main Collection:
    - Sample 1:
        - matrix.mtx
        - barcodes.tsv
        - genes.tsv
    - Sample 2:
        - matrix.mtx
        - barcodes.tsv
        - genes.tsv
...

For the CellPlex workflow:

  • CMO processing:
    • CITE-Seq Count is used to asign reads and generate a matrix where 'genes' are the CMO and 'unmapped'.
    • Cellular barcodes are translated to match the cellular barcodes of Gene expression see this article.
    • Reorganize the output with UMI matrices to match the same structure as gene expression matrices.

Test data

The test dataset has been produced to make it as small as possible in order to make the workflow pass on CI.

  • The CMO reads come from zenodo and have been sampled to 0.1 with seqtk.
  • The GEX reads come from SRR13948489 but have been subsetted to the cells selected in the above zenodo.