Gallantries Grant - Intellectual Output 1 - Introduction to data analysis and -management, statistics, and coding

purlPURL: https://gxy.io/GTN:P00012
Comment: What is a Learning Pathway?
A graphic depicting a winding path from a start symbol to a trophy, with tutorials along the way
We recommend you follow the tutorials in the order presented on this page. They have been selected to fit together and build up your knowledge step by step. If a lesson has both slides and a tutorial, we recommend you start with the slides, then proceed with the tutorial.

This Learning Pathway collects the results of Intellectual Output 1 in the Gallantries Project

In total, this module will form a course of around 10 days (± 2 days depending on exact analysis stories we identify). Some of these introductory submodules will build on existing training material available in the GTN or Carpentries (~15%).

Success Criteria:

Year 1: Coding in Python

Intro to Coding in Python. Covers variables, functions, and data structures [SC1.1,2]

Time estimation: 8 hours

Learning Objectives
  • Learn the fundamentals of programming in Python
  • Use the scientific libraries pandas and numpy to explore tabular datasets
  • Calculate basic statistics about datasets and columns
Lesson Slides Hands-on Recordings
Introduction to Python
Advanced Python

Year 1: Coding in Python Modular (Avans)

Intro to Coding in Python. Covers variables, functions, and data structures [SC1.1,2]

In collaboration with Avans Hogeschool, an associated Partner we produced the following lessons

Time estimation: 9 hours 40 minutes

Learning Objectives
  • Understand the fundamentals of object assignment and math in python and can write simple statements and execute calcualtions in order to be able to summarize the results of calculations and classify valid and invalid statements.
  • Translate some known math functions (e.g. euclidean distance, root algorithm) into python to transfer concepts from mathematics lessons directly into Python.
  • Understand the structure of a "function" in order to be able to construct their own functions and predict which functions will not work.
  • Explain key differences between integers and floating point numbers.
  • Explain key differences between numbers and character strings.
  • Use built-in functions to convert between integers, floating point numbers, and strings.
  • Explain why programs need collections of values.
  • Write programs that create flat lists, index them, slice them, and modify them through assignment and method calls.
  • Write conditional statements including `if`, `elif`, and `else` branches.
  • Correctly evaluate expressions containing `and` and `or`.
  • Explain what for loops are normally used for.
  • Trace the execution of a simple (unnested) loop and correctly state the values of variables in each iteration.
  • Write for loops that use the Accumulator pattern to aggregate values.
  • catch an exception
  • raise your own exception
  • Read data from a file
  • Write new data to a file
  • Use `with` to ensure the file is closed properly.
  • Use the CSV module to parse comma and tab separated datasets.
  • Recap all previous modules.
  • Use exercises to ensure that all previous knowledge is sufficiently covered.
  • Use glob to collect a list of files
  • Learn about the potential pitfalls of glob
  • Learn how sys.argv works
  • Write a simple command line program that sums some numbers
  • Use argparse to make it nicer.
  • Run a command in a subprocess.
  • Learn about `check_call` and `check_output` and when to use each of these.
  • Read it's output.
  • Set up a Python virtual environment for our software project using `venv` and `pip`.
  • Run our software from the command line.
  • Set up a Conda environment for our software project using `conda`.
  • Run our software from the command line.
Lesson Slides Hands-on Recordings
Python - Math
Python - Functions
Python - Basic Types & Type Conversion
Python - Lists & Strings & Dictionaries
Python - Flow Control
Python - Loops
Python - Try & Except
Python - Files & CSV
Python - Introductory Graduation
Python - Globbing
Python - Argparse
Python - Subprocess
Virtual Environments For Software Development
Conda Environments For Software Development

Year 1: Coding in R

Intro to Coding in R. Covers variables, functions, and data structures [SC1.1,2]

Time estimation: 6 hours

Learning Objectives
  • Know advantages of analyzing data using R within Galaxy.
  • Compose an R script file containing comments, commands, objects, and functions.
  • Be able to work with objects (i.e. applying mathematical and logical operators, subsetting, retrieving values, etc).
  • Be able to load and explore the shape and contents of a tabular dataset using base R functions.
  • Understand factors and how they can be used to store and work with categorical data.
  • Apply common `dplyr` functions to manipulate data in R.
  • Employ the ‘pipe’ operator to link together a sequence of functions.
  • Read data with the built-in `read.csv`
  • Read data with dplyr's `read_csv`
  • Use dplyr and tidyverse functions to cleanup data.
Lesson Slides Hands-on Recordings
R basics in Galaxy
Advanced R in Galaxy
R
dplyr & tidyverse for data processing

Year 1: Intro to Command Line

This submodule will cover the basics of the shell (variables, for loops), needed for data handling [SC1.1,2,6]

Time estimation: 8 hours

Learning Objectives
  • Explain how the shell relates to the keyboard, the screen, the operating system, and users' programs.
  • Explain when and why command-line interfaces should be used instead of graphical interfaces.
  • Explain the similarities and differences between a file and a directory.
  • Translate an absolute path into a relative path and vice versa.
  • Construct absolute and relative paths that identify specific files and directories.
  • Use options and arguments to change the behaviour of a shell command.
  • Demonstrate the use of tab completion and explain its advantages.
  • Create a directory hierarchy that matches a given diagram.
  • Create files in that hierarchy using an editor or by copying and renaming existing files.
  • Delete, copy and move specified files and/or directories.
  • Redirect a command's output to a file.
  • Process a file instead of keyboard input using redirection.
  • Construct command pipelines with two or more stages.
  • Explain what usually happens if a program or pipeline isn't given any input to process.
  • Explain Unix's 'small pieces, loosely joined' philosophy.
  • Write a loop that applies one or more commands separately to each file in a set of files.
  • Trace the values taken on by a loop variable during execution of the loop.
  • Explain the difference between a variable's name and its value.
  • Explain why spaces and some punctuation characters shouldn't be used in file names.
  • Demonstrate how to see what commands have recently been executed.
  • Re-run recently executed commands without retyping them.
  • Use `grep` to select lines from text files that match simple patterns.
  • Use `find` to find files and directories whose names match simple patterns.
  • Use the output of one command as the command-line argument(s) to another command.
  • Explain what is meant by 'text' and 'binary' files, and why many common tools don't handle the latter well.
  • Explore the bash dungeon and fight monsters
  • Reinforce the learning of CLI basics such as how to change directories, move around, find things, and symlinkings
  • Write a snakefile that does a simple QC and Mapping workflow
Lesson Slides Hands-on Recordings
CLI basics
Advanced CLI in Galaxy
CLI Educational Game - Bashcrawl
Make & Snakemake

Year 1: Intro to Git and GitHub

This submodule will cover the basics of research software development and sharing (committing, branching, forking, GitHub, etc.) [SC1.1,2,6]

Time estimation: 2 hours 25 minutes

Learning Objectives
  • Understand the benefits of an automated version control system.
  • Understand the basics of how automated version control systems work.
  • Configure `git` the first time it is used on a computer.
  • Understand the meaning of the `--global` configuration flag.
  • Create a local Git repository.
  • Describe the purpose of the `.git` directory.
  • Go through the modify-add-commit cycle for one or more files.
  • Explain where information is stored at each stage of that cycle.
  • Distinguish between descriptive and non-descriptive commit messages.
  • Explain what the HEAD of a repository is and how to use it.
  • Identify and use Git commit numbers.
  • Compare various versions of tracked files.
  • Restore old versions of files.
  • Create a repository
  • Commit a file
  • Make some changes
  • Use the log to view the diff
  • Undo a bad change
  • Fork a repository on GitHub
  • Clone a remote repository locally
  • Create a branch
  • Commit changes
  • Push changes to a remote repository
  • Create a pull request
  • Update a pull request
  • Edit a file via GitHub interface
  • Create a pull request
  • Update a pull request
Lesson Slides Hands-on Recordings
Version Control with Git
Basics of using Git from the Command Line
Contributing with GitHub via command-line
Contributing with GitHub via its interface

Year 2: Introduction to Genomics

This submodule covers the biological background, as well as the technological concepts involved in genome sequencing, and their effects on downstream data analysis. [SC1.3,4,6]

Year 2: Quality Control

This submodule will cover the evaluation of the quality of datasets, and how to improve quality by a cyclic process of cleaning, trimming and filtering datasets and re-evaluating the quality. [SC1.3-5]

Time estimation: 1 hour 30 minutes

Learning Objectives
  • Assess short reads FASTQ quality using FASTQE 🧬😎 and FastQC
  • Assess long reads FASTQ quality using Nanoplot and PycoQC
  • Perform quality correction with Cutadapt (short reads)
  • Summarise quality metrics MultiQC
  • Process single-end and paired-end data
Lesson Slides Hands-on Recordings
Quality Control

Year 2: Mapping

This submodule will cover the comparison of genome sequencing samples to a reference genome. The concept of reference data is relevant in many data analyses across life sciences; connecting to online databases and incorporating this data into an analysis. [SC1.3,4]

Time estimation: 1 hour

Learning Objectives
  • Run a tool to map reads to a reference genome
  • Explain what is a BAM file and what it contains
  • Use genome browser to understand your data
Lesson Slides Hands-on Recordings
Mapping

Year 3: Variant Analysis

This submodule will cover the topic of variant calling; after mapping of sequences to the reference genome, the regions that are different from the reference genome (variants) must be determined, and evaluated for impact. As any two individuals will by definition show many differences, the challenge of distinguishing between healthy variation and potential disease-causing variants is one of the main challenges in variant calling. [SC1.3-5]

Time estimation: 50 minutes

Learning Objectives
  • Understand the steps involved in variant calling.
  • Describe the types of data formats encountered during variant calling.
  • Use command line tools to perform variant calling.
Lesson Slides Hands-on Recordings
Variant Calling Workflow

Year 3: Transcriptomics

DNA only describes the potential of the genome; which genes are actually active within the cell and impacting the health and function of the organism, is determined via transcriptomics (RNA sequencing). By integrating data from these two levels of analysis (DNA and RNA), a clearer picture of the state of the cell can be obtained. [SC1.3-5]

Time estimation: 1 hour 30 minutes

Learning Objectives
  • Learn the basics to process RNA sequences
  • Check the quality and trim the sequences with bash
  • Use command line STAR aligner to map the RNA sequences
  • Estimate the number of reads per gens
Lesson Slides Hands-on Recordings
RNA-seq Alignment with STAR

Editorial Board

This material is reviewed by our Editorial Board:

orcid logoFotis E. Psomopoulos avatar Fotis E. Psomopoulosorcid logoSaskia Hiltemann avatar Saskia Hiltemannorcid logoHelena Rasche avatar Helena Rasche

Funding

These individuals or organisations provided funding support for the development of this resource