Introduction to sequencing with Python (part four)

Under Development!

This tutorial is not in its final state. The content may change a lot in the next months. Because of this status, it is also not listed in the topic pages.

Author(s)	Anton Nekrutenko
Reviewers

Overview
Questions:

How to manipulate files in Python

How to read and write FASTA

How to read and write FASTQ

How to read and write SAM

Objectives:

Understand manipulation of files in Python

Time estimation: 1 hour

Supporting Materials:

Jupyter Notebook

instances Available on these Galaxies

Possibly Working

UseGalaxy.eu

UseGalaxy.org

UseGalaxy.org.au

UseGalaxy.fr

Published: Feb 13, 2024

Last modification: Feb 20, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00411

version Revision: 2

Best viewed in a Jupyter Notebook

This tutorial is best viewed in a Jupyter notebook! You can load this notebook one of the following ways

Run on the GTN with JupyterLite (in-browser computations)

Click to Launch JupyterLite

Launching the notebook in Jupyter in Galaxy

Instructions to Launch JupyterLab

Open a Terminal in JupyterLab with File -> New -> Terminal

Run wget https://training.galaxyproject.org/training-material/topics/data-science/tutorials/gnmx-lecture5/data-science-gnmx-lecture5.ipynb

Select the notebook that appears in the list of files on the left.

Downloading the notebook

Right click one of these links: Jupyter Notebook (With Solutions), Jupyter Notebook (Without Solutions)

Save Link As..

Reading and writing files in Python

First let’s download a file we will be using to your notebooks:

!wget https://raw.githubusercontent.com/nekrut/BMMB554/master/2023/data/l9/mt_cds.fa

In Python, you can handle files using the built-in open function. The open function creates a file object, which you can use to read, write, or modify the file.

Here’s an example of how to open a file for reading:

f = open("mt_cds.fa", "r")

In this example, the open function takes two arguments: the name of the file, and the mode in which you want to open the file. The r mode indicates that you want to open the file for reading.

After you’ve opened the file, you can read its contents using the read method:

contents = f.read()
print(contents)

You can also read the file line by line using the readline method:

line = f.readline()
print(line)

When you’re done reading the file, you should close it using the close method:

f.close()

You can also use the with statement to automatically close the file when you’re done:

with open("mt_cds.fa", "r") as f:
    contents = f.read()
    print(contents)

You can also write to files using write method (note the "w" mode):

f = open("sample.txt", "w")
f.write("This is a new line.")
f.close()

If you open an existing file in write mode, its contents will be overwritten. If you want to append to an existing file instead, you can use the "a" mode:

f = open("sample.txt", "a")
f.write("This is another line.")
f.close()

When you’re writing to a file, it’s important to make sure you close the file when you’re done. If you don’t close the file, any changes you make may not be saved.

In addition to reading and writing text files, you can also use Python to handle binary files, such as images or audio files.

Let’s download an image:

!wget https://imgs.xkcd.com/comics/file_extensions.png

Here’s an example of how to read an image file:

with open("file_extensions.png", "rb") as f:
    contents = f.read()

Note that when working with binary files, you must use the "rb" mode for reading and the "wb" mode for writing.

There are many more features and methods related to file handling in Python, but the basics covered here should be enough to get you started.

Fasta

Fasta is a file format that is commonly used to store biological sequences, such as DNA or protein sequences. In Python, you can read a Fasta file by opening the file, reading the lines one by one, and processing the data as needed.

Here’s an example of how you might read a Fasta file in Python:

sequences = {}
with open("mt_cds.fa", "r") as file:
  header = ""
  sequence = ""
  for line in file:
    line=line.rstrip()
    if line.startswith('>'):
      if header != "":
        sequences[header] = sequence
        sequence = ""
      header = line[1:]
    else:
      sequence += line
  if header != "":
    sequences[header] = sequence

The code above uses a with statement to open the file and read the lines one by one. If a line starts with a ">", it is assumed to be a header, and the current sequence is stored in the dictionary using the current header as the key. If the line does not start with a ">", it is assumed to be part of the current sequence.

FASTQ

Let’s download a sample fastq file:

!wget https://raw.githubusercontent.com/nekrut/BMMB554/master/2023/data/l9/reads.fq

Fastq is a file format that is commonly used to store high-throughput sequencing data. It consists of a series of records, each of which includes a header, a sequence, a quality score header, and a quality score string. In Python, you can read a Fastq file by opening the file, reading the lines four at a time, and processing the data as needed.

Here’s an example of how you might read a Fastq file in Python:

def read_fastq(file_path):
    records = []
    with open(file_path, "r") as f:
        while True:
            header = f.readline().strip()
            if header == "":
                break
            sequence = f.readline().strip()
            quality_header = f.readline().strip()
            quality = f.readline().strip()
            records.append((header, sequence, quality))
    return records

In this example, the read_fastq function takes a file path as an argument, and returns a list of records, where each record is a tuple of four strings: the header, the sequence, the quality score header, and the quality score string. The function uses a while loop to read the lines four at a time until the end of the file is reached.

You can use this function to read a Fastq file like this:

records = read_fastq("reads.fq")
for header, sequence, quality_header, quality in records:
    print(header)
    print(sequence)
    print(quality_header)
    print(quality)

This will print the headers, sequences, quality score headers, and quality scores in the Fastq file. You can modify the read_fastq function to process the data in any way you need.

SAM

Let’s download an example SAM file:

!wget https://raw.githubusercontent.com/nekrut/BMMB554/master/2023/data/l9/sam_example.sam

SAM (Sequence Alignment/Map) is a file format that is used to store the results of DNA sequencing alignments. In Python, you can read a SAM file by opening the file, reading the lines one by one, and processing the data as needed.

Here’s an example of how you might read a SAM file in Python:

def read_sam(file_path):
    records = []
    with open(file_path, "r") as f:
        for line in f:
            if line.startswith("@"):
                continue
            fields = line.strip().split("\t")
            records.append(fields)
    return records

In this example, the read_sam function takes a file path as an argument, and returns a list of records, where each record is a list of fields. The function uses a with statement to open the file and read the lines one by one. If a line starts with an "@", it is assumed to be a header and is ignored. If the line does not start with an "@", it is assumed to be a record, and the fields are extracted by splitting the line on tabs.

You can use this function to read a SAM file like this:

records = read_sam("sam_example.sam")
for fields in records:
    print(fields)

This will print the fields in the SAM file. You can modify the read_sam function to process the data in any way you need. For example, you might want to extract specific fields, such as the reference name, the start position, and the cigar string.

You've Finished the Tutorial

Key points

Python can be used to read and write files

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Anton Nekrutenko, Introduction to sequencing with Python (part four) (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/data-science/tutorials/gnmx-lecture5/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{data-science-gnmx-lecture5,
author = "Anton Nekrutenko",
	title = "Introduction to sequencing with Python (part four) (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/data-science/tutorials/gnmx-lecture5/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.
shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/data-science/tutorials/gnmx-lecture5/tutorial.json | jq .admin_install_yaml -r)
Alternatively you can copy and paste the following YAML
---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools: []

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.