Home Pattern

Galaxy: sequence patterns

Use this MOC to choose corpus-grounded Galaxy operations on sequence records (FASTA) — interconvert, reformat, merge, length, extract/mask by region.

Revised
2026-06-10
Rev
1

Galaxy: sequence patterns

The runtime-facing map for Galaxy sequence-record choices — operations that read, reshape, subset, or interconvert FASTA (nucleotide or protein) records, as opposed to opaque-column galaxy-tabular-patterns, coordinate-feature galaxy-interval-patterns, or container-shaped galaxy-collection-patterns. Use it before loading raw survey notes; iwc-sequence-operations-survey is the evidence backing, these pages are the actionable references.

Sequence records are arguably the most fundamental bioinformatics data shape, but in IWC the record-manipulation cluster (as opposed to the domain analysis that consumes sequence) is moderate, and its center of gravity is the FASTA ↔ tabular seam: the corpus reaches for a table whenever it needs to edit records, especially headers. Reach for the relabel recipe first when your need is header surgery across many records.

Interconvert (the dominant seam)

Reformat & combine

Compute

  • sequence-compute-length — emit a (id, length) table for downstream tabular thresholding (fasta_compute_length). Per-record length only, not assembly statistics.

Extract & mask by region (interval/annotation bridge)

  • sequence-extract-by-region — turn coordinates into sequence: extract at BED intervals (bedtools getfasta), mask regions by BED (bedtools maskfasta), or extract transcript/CDS FASTA from a GFF (gffread).

Recipes

  • relabel-fasta-headers-via-tabular — the high-value construct: edit FASTA headers you cannot easily regex in place — fasta2tab → find/replace on column 1 → tab2fasta.

Bridges

  • sequence ↔ tabular — the dominant seam. fasta2tab/tab2fasta interconvert; fasta_compute_length emits a length table; the relabel recipe lives entirely here. The line: tabular treats the record as opaque columns (header = col 1, sequence = col 2); sequence operations understand FASTA structure. See galaxy-tabular-patterns.
  • sequence ↔ intervalgetfasta (intervals → sequence) and maskfasta (intervals + sequence → masked sequence) consume coordinate features and emit sequence. galaxy-interval-patterns owns the BED that feeds them on the output-shape rule; this MOC owns the extraction. See sequence-extract-by-region and galaxy-interval-patterns.
  • alignment / annotation → sequencesamtools_fastx (BAM → FASTQ/FASTA) and gffread (GFF + genome → transcript FASTA) bridge from upstream domains into sequence records. bamtobed (BAM → BED) is the interval-side analogue owned by galaxy-interval-patterns.

Thin — tracked, not yet paged

Corpus-present but single-source; documented so the omission is deliberate. A page follows when a second independent exemplar appears (the same hold-if-thin discipline as the interval MOC):

  • subset by id listseqtk_subseq (iuc; also does ranges) and filter_by_fasta_ids (galaxyp; discarded-complement + regex anchoring) each appear once. When paged, lead with seqtk_subseq and footnote filter_by_fasta_ids.
  • filter by lengthfasta_filter_by_length (one VGP-decontamination use).
  • translate (nt → protein)seqkit_translate (one metagenomic-gene-catalogue use).
  • FASTQ → FASTAfastq_to_fasta_python (mgnify reads QC); record-format interconversion at the reads-domain edge.

Gaps (no corpus exemplar, no page)

Per corpus-first, zero IWC uptake → no page; documented here so the absence is explicit:

  • reverse-complement (standalone), sequence sort, composition / GC compute, standalone dedup (seqkit_rmdup), EMBOSS seqret/transeq, fasta_nucleotide_changer. Common in GTN training material and the Tool Shed, but no IWC workflow reaches for them.

These are tracked as IWC-input-blocked candidates; a page follows only when an IWC workflow uses the operation.

Out of scope

Domain analysis that consumes or emits sequence but does not manipulate records: assembly (gfastats — the corpus’s largest FASTA-touching tool, all VGP), metagenomic binning, proteomics search-DB build, alignment, annotation, variant calling, search, profiling. These route through tool discovery, not patterns. See iwc-sequence-operations-survey §6.

See also

Incoming References (7)

  • Sequence: relabel FASTA headers via tabularrelated pattern— Edit FASTA headers you cannot easily regex in place: fasta2tab, rewrite column 1 with find/replace, then tab2fasta back. The high-value sequence recipe.
  • Sequence: compute record lengthsrelated pattern— Emit a (id, length) table from a FASTA so downstream tabular steps can filter, sort, or threshold records by length; fasta_compute_length.
  • Sequence: extract or mask by regionrelated pattern— Turn coordinates into sequence: extract FASTA at BED intervals (getfasta), mask regions by BED (maskfasta), or extract transcripts from a GFF (gffread).
  • Sequence: interconvert FASTA and tabularrelated pattern— Move sequence records between FASTA and a (header, sequence) table so tabular tools can edit them; fasta2tab one way, tab2fasta back.
  • Sequence: merge FASTA and filter uniquerelated pattern— Concatenate several FASTA files into one and drop duplicate records by sequence identity in a single step; fasta_merge_files_and_filter_unique_sequences.
  • Sequence: reformat FASTA line widthrelated pattern— Rewrap FASTA records to a fixed sequence-line width so downstream tools and viewers get canonical 60/70/80-column output; cshl_fasta_formatter.
  • Iwc Sequence Operations Surveyrelated note— IWC survey of record-level FASTA manipulation (interconversion, reformat, merge/dedup, subset, extract-at-intervals); sizes a galaxy-sequence-patterns MOC.