Frequently Asked Questions

Format


FASTQ format

Although it looks complicated (and maybe it is), the FASTQ format is easy to understand with a little decoding. Each read, representing a fragment of DNA, is encoded by 4 lines:

Line Description
1 Always begins with @ followed by the information about the read
2 The actual nucleic sequence
3 Always begins with a + and contains sometimes the same info in line 1
4 Has a string of characters which represent the quality scores associated with each base of the nucleic sequence; must have the same number of characters as line 2

So for example, the first sequence in our file is:

@03dd2268-71ef-4635-8bce-a42a0439ba9a runid=8711537cc800b6622b9d76d9483ecb373c6544e5 read=252 ch=179 start_time=2019-12-08T11:54:28Z flow_cell_id=FAL10820 protocol_group_id=la_trappe sample_id=08_12_2019
AGTAAGTAGCGAACCGGTTTCGTTTGGGTGTTTAACCGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTCGTGCGGAAGGCGCTTCACCCAGGGCCTCTCATGCTTTGTCTTCCTGTTTATTCAGGATCGCCCAAAGCGAGAATCATACCACTAGACCACACGCCCGAATTATTGTTGCGTTAATAAGAAAAGCAAATATTTAAGATAGGAAGTGATTAAAGGGAATCTTCTACCAACAATATCCATTCAAATTCAGGCA
+
$'())#$$%#$%%'-$&$%'%#$%('+;<>>>18.?ACLJM7E:CFIMK<=@0/.4<9<&$007:,3<IIN<3%+&$(+#$%'$#$.2@401/5=49IEE=CH.20355>-@AC@:B?7;=C4419)*$$46211075.$%..#,529,''=CFF@:<?9B522.(&%%(9:3E99<BIL?:>RB--**5,3(/.-8B>F@@=?,9'36;:87+/19BAD@=8*''&''7752'$%&,5)AM<99$%;EE;BD:=9<@=9+%$

It means that the fragment named @03dd2268-71ef-4635-8bce-a42a0439ba9a (ID given in line1) corresponds to:

  • the DNA sequence AGTAAGTAGCGAACCGGTTTCGTTTGGGTGTTTAACCGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTCGTGCGGAAGGCGCTTCACCCAGGGCCTCTCATGCTTTGTCTTCCTGTTTATTCAGGATCGCCCAAAGCGAGAATCATACCACTAGACCACACGCCCGAATTATTGTTGCGTTAATAAGAAAAGCAAATATTTAAGATAGGAAGTGATTAAAGGGAATCTTCTACCAACAATATCCATTCAAATTCAGGCA (line2)
  • this sequence has been sequenced with a quality $'())#$$%#$%%'-$&$%'%#$%('+;<>>>18.?ACLJM7E:CFIMK<=@0/.4<9<&$007:,3<IIN<3%+&$(+#$%'$#$.2@401/5=49IEE=CH.20355>-@AC@:B?7;=C4419)*$$46211075.$%..#,529,''=CFF@:<?9B522.(&%%(9:3E99<BIL?:>RB--**5,3(/.-8B>F@@=?,9'36;:87+/19BAD@=8*''&''7752'$%&,5)AM<99$%;EE;BD:=9<@=9+%$ (line 4).

But what does this quality score mean?

The quality score for each sequence is a string of characters, one for each base of the nucleotide sequence, used to characterize the probability of misidentification of each base. The score is encoded using the ASCII character table (with some historical differences):

Encoding of the quality score with ASCII characters for different Phred encoding. The ascii code sequence is shown at the top with symbols for 33 to 64, upper case letters, more symbols, and then lowercase letters. Sanger maps from 33 to 73 while solexa is shifted, starting at 59 and going to 104. Illumina 1.3 starts at 54 and goes to 104, Illumina 1.5 is shifted three scores to the right but still ends at 104. Illumina 1.8+ goes back to the Sanger except one single score wider. Illumina

So there is an ASCII character associated with each nucleotide, representing its Phred quality score, the probability of an incorrect base call:

Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%

Punteggi di qualità

Ma cosa significa questo punteggio di qualità?

Il punteggio di qualità per ogni sequenza è una stringa di caratteri, uno per ogni base della sequenza nucleotidica, utilizzata per caratterizzare la probabilità di errata identificazione di ogni base. Il punteggio è codificato utilizzando la tabella dei caratteri ASCII (con alcune differenze storiche):

Per risparmiare spazio, il sequenziatore registra un carattere ASCII per rappresentare i punteggi da 0 a 42. Ad esempio, 10 corrisponde a “+”. Ad esempio, 10 corrisponde a “+” e 40 a “I”. FastQC sa come tradurlo. Questo viene spesso chiamato punteggio “Phred”.

Codifica del punteggio di qualità con caratteri ASCII per diverse codifiche Phred. La sequenza di codici ascii è mostrata in alto con i simboli da 33 a 64, le lettere maiuscole, altri simboli e poi le lettere minuscole. Sanger mappa da 33 a 73, mentre Solexa è spostato, partendo da 59 e arrivando a 104. Illumina 1.3 inizia a 54 e arriva a 104, Illumina 1.5 è spostato di tre posizioni a destra ma termina comunque a 104. Illumina 1.8+ torna al Sanger, tranne che per un singolo punteggio più ampio. Illumina

Quindi a ogni nucleotide è associato un carattere ASCII che rappresenta il suo punteggio di qualità Phred, la probabilità di una chiamata di base errata:

Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%

Cosa rappresenta 0-42? Questi numeri, se inseriti in una formula, ci dicono la probabilità di errore per quella base. Questa è la formula, dove Q è il nostro punteggio di qualità (0-42) e P è la probabilità di errore:

Q = -10 log10(P)

Utilizzando questa formula, possiamo calcolare che un punteggio di qualità di 40 significa solo 0,00010 probabilità di errore!

Puntuación de calidad

¿Pero qué significa esta puntuación de calidad?

La puntuación de calidad de cada secuencia es una cadena de caracteres, uno por cada base de la secuencia de nucleótidos, que se utiliza para caracterizar la probabilidad de identificación errónea de cada base. La puntuación se codifica utilizando la tabla de caracteres ASCII (con algunas diferencias históricas):

Para ahorrar espacio, el secuenciador registra un carácter ASCII para representar las puntuaciones 0-42. Por ejemplo, 10 corresponde a “+” y 40 a “I”. FastQC sabe cómo traducir esto. A menudo se denomina puntuación “Phred”.

Codificación de la puntuación de calidad con caracteres ASCII para diferentes codificaciones Phred. La secuencia de códigos ascii se muestra en la parte superior con símbolos para 33 a 64, letras mayúsculas, más símbolos y luego letras minúsculas. Sanger mapea de 33 a 73 mientras que solexa está desplazado, empezando en 59 y llegando hasta 104. Illumina 1.3 comienza en 54 y llega hasta 104, Illumina 1.5 se desplaza tres puntuaciones a la derecha, pero aún así termina en 104. Illumina 1.8+ se remonta al Sanger excepto una sola puntuación más ancha. Illumina

Así que hay un carácter ASCII asociado a cada nucleótido, que representa su puntuación de calidad Phred, la probabilidad de una llamada de base incorrecta:

Puntuación de calidad Phred Probabilidad de una llamada de base incorrecta Precisión de llamada de base
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%

¿Qué representa 0-42? Estos números, cuando se introducen en una fórmula, nos indican la probabilidad de error para esa base. Esta es la fórmula, donde Q es nuestra puntuación de calidad (0-42) y P es la probabilidad de error:

Q = -10 log10(P)

Utilizando esta fórmula, podemos calcular que una puntuación de calidad de 40 significa sólo 0,00010 de probabilidad de error

Quality Scores

But what does this quality score mean?

The quality score for each sequence is a string of characters, one for each base of the nucleotide sequence, used to characterize the probability of misidentification of each base. The score is encoded using the ASCII character table (with some historical differences):

To save space, the sequencer records an ASCII character to represent scores 0-42. For example 10 corresponds to “+” and 40 corresponds to “I”. FastQC knows how to translate this. This is often called “Phred” scoring.

Encoding of the quality score with ASCII characters for different Phred encoding. The ascii code sequence is shown at the top with symbols for 33 to 64, upper case letters, more symbols, and then lowercase letters. Sanger maps from 33 to 73 while solexa is shifted, starting at 59 and going to 104. Illumina 1.3 starts at 54 and goes to 104, Illumina 1.5 is shifted three scores to the right but still ends at 104. Illumina 1.8+ goes back to the Sanger except one single score wider. Illumina

So there is an ASCII character associated with each nucleotide, representing its Phred quality score, the probability of an incorrect base call:

Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%

What does 0-42 represent? These numbers, when plugged into a formula, tell us the probability of an error for that base. This is the formula, where Q is our quality score (0-42) and P is the probability of an error:

Q = -10 log10(P)

Using this formula, we can calculate that a quality score of 40 means only 0.00010 probability of an error!

Qualitätswerte

Aber was bedeutet dieser Qualitätswert?

Der Qualitätsscore für jede Sequenz ist eine Zeichenkette, eine für jede Base der Nukleotidsequenz, um die Wahrscheinlichkeit einer falschen Identifizierung jeder Base zu charakterisieren. Der Score wird unter Verwendung der ASCII-Zeichentabelle (mit einigen historischen Unterschieden) kodiert:

Um Platz zu sparen, zeichnet der Sequenzer ein ASCII-Zeichen auf, um die Punktzahlen 0-42 darzustellen. Zum Beispiel entspricht 10 einem “+” und 40 einem “I”. FastQC weiß, wie dies zu übersetzen ist. Dies wird oft als “Phred”-Bewertung bezeichnet.

Kodierung der Qualitätsbewertung mit ASCII-Zeichen für verschiedene Phred-Kodierungen. Oben ist die ASCII-Codefolge mit Symbolen für 33 bis 64, Großbuchstaben, weiteren Symbolen und dann Kleinbuchstaben dargestellt. Die Sanger-Kodierung reicht von 33 bis 73, während die Solexa-Kodierung verschoben ist und bei 59 beginnt und bis 104 geht. Illumina 1.3 beginnt bei 54 und geht bis 104, Illumina 1.5 ist um drei Ziffern nach rechts verschoben, endet aber immer noch bei 104. Illumina 1.8+ entspricht dem Sanger-Test, ist aber um einen einzigen Punkt breiter. Illumina

Jedem Nukleotid ist also ein ASCII-Zeichen zugeordnet, das seinen Phred-Qualitätsscore, die Wahrscheinlichkeit eines falschen Basenrufs, darstellt:

Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%

Was bedeutet 0-42? Wenn man diese Zahlen in eine Formel einsetzt, erhält man die Fehlerwahrscheinlichkeit für diese Basis. Die Formel lautet wie folgt: Q ist unser Qualitätswert (0-42) und P ist die Fehlerwahrscheinlichkeit:

Q = -10 log10(P)

Mit dieser Formel können wir berechnen, dass eine Qualitätsbewertung von 40 nur 0,00010 Wahrscheinlichkeit eines Fehlers bedeutet!


Visualisation


Using IGV with Galaxy

You can send data from your Galaxy history to IGV for viewing as follows:

  1. Install IGV on your computer (IGV download page)
  2. Start IGV
  3. In recent versions of IGV, you will have to enable the port:
    • In IGV, go to View > Preferences > Advanced
    • Check the box Enable Port
  4. In Galaxy, expand the dataset you would like to view in IGV
    • Make sure you have set a reference genome/database correctly (dbkey) (instructions)
    • Under display in IGV, click on local



Still have questions?
Gitter Chat Support
Galaxy Help Forum