Home Research

Iwc Sequence Operations Survey

IWC survey of record-level FASTA manipulation (interconversion, reformat, merge/dedup, subset, extract-at-intervals); sizes a galaxy-sequence-patterns MOC.

Raw
Revised
2026-06-10
Rev
1
component

IWC sequence-record operations survey

Backs #272 — a candidate galaxy-sequence-patterns MOC, fourth sibling to galaxy-collection-patterns, galaxy-tabular-patterns, and galaxy-interval-patterns. Scope is record-level sequence (FASTA / nucleotide-or-protein record) manipulation — operations that read, reshape, subset, or interconvert sequence records — not the domain tools that analyze sequence (alignment, assembly, annotation, variant calling, search, profiling). Same discipline as #268: scope by data-shape family + manipulation algebra, not domain. The issue framed a hold-if-thin gate; the survey’s first job is to size the cluster honestly.

Source corpus: $IWC_FORMAT2/ — 120 .gxwf.yml files. Citations are $IWC_FORMAT2/path plus step label or tool name; line numbers only for stable parameter snippets.

TL;DR — the sizing finding

Sequence-record manipulation is real and slightly broader than interval algebra, but moderate — not the rich cluster the issue’s headline counts imply. The issue’s table ranks operations by tool_id-occurrence count (fasta_formatter 20, getfasta 16, merge 14, …). Those counts reproduce exactly, but they are inflated by the same subworkflow embedding the interval survey caught: distinct workflow counts are far lower.

Operation (corpus-observed)tooltool_id occdistinct wf
FASTA reformat (line width)devteam/fasta_formatter (cshl_fasta_formatter)204
Extract sequence at intervalsiuc/bedtools (bedtools_getfastabed)163
Merge FASTA + filter unique sequencesgalaxyp/fasta_merge_files_and_filter_unique_sequences144
FASTA → tabulardevteam/fasta_to_tabular (fasta2tab)134
tabular → FASTAdevteam/tabular_to_fasta (tab2fasta)113
Sequence length computedevteam/fasta_compute_length73
Mask sequence by intervalsiuc/bedtools (bedtools_maskfastabed)52
Extract transcript/CDS FASTA from annotationdevteam/gffread84
Filter by lengthdevteam/fasta_filter_by_length21
Translate (nt → protein)iuc/seqkit_translate21
Subset by id listiuc/seqtk (seqtk_subseq)21
Subset/filter by id listgalaxyp/filter_by_fasta_ids31
FASTQ → FASTAdevteam/fastqtofasta (fastq_to_fasta_python)103

Three findings sharpen the gate:

  1. The healthy core is interconversion + reformat, not the interval-bridge tools. getfasta/maskfasta are the operations the #268 scope-edge analysis surfaced, but in distinct-workflow terms they are mid-pack (3 / 2). The operations that actually recur are fasta↔tabular interconversion (fasta2tab/tab2fasta, 4 / 3) and reformat (fasta_formatter, 4) and merge+dedup (4). The single most reusable construct is a multi-step recipe — relabel FASTA headers by detouring through tabular — not any one operation (§3).
  2. The 20-occurrence headline (fasta_formatter) is ~2–3 authoring contexts. All four fasta_formatter workflows are the mgnify amplicon family (mgnify-amplicon-pipeline-v5-{quality-control-paired-end,quality-control-single-end,rrna-prediction,complete}); …-complete embeds the other three as subworkflows. Same inflation hits fasta2tab/tab2fasta (the pathogen-identification family) and getfasta. Distinct-context counts run roughly half the workflow counts above.
  3. The biggest FASTA-touching tool in the corpus is out of scope and must be named, or the inventory looks absurd. bgruening/gfastats has 113 tool_id occurrences across 10 VGP-assembly workflows — it dwarfs everything here. But it is an assembly tool (FASTA↔GFA conversion + assembly statistics: “Convert purged fasta to gfa”, “gfastats gfa hap1”, $IWC_FORMAT2/VGP-assembly-v2/...), domain-out per the issue. Counting it would make “sequence manipulation” read as a VGP-assembly concern, which it is not. Held out; recorded here so the hole is deliberate.

Net for the gate (the decision belongs to /iwc-survey-act, not here): a sequence MOC is defensible and modestly broader than interval, but should lead with the interconversion seam and its relabel recipe, keep the thin operations (translate, filter-by-length, subset-by-id) as footnoted ingredients, and treat getfasta/maskfasta as shared bridges with #268, not the headline.

1. What exists — operation inventory

Distinct sequence-record operations, by the move they make. Counts are evidence, not the headline; distinct-workflow counts lead (§TL;DR).

FASTA ↔ tabular interconversion — the strongest seam (4 / 3 workflows)

The corpus reaches for the tabular form whenever it needs to edit sequence records (headers especially), because FASTA headers are awkward to regex in place but trivial as column 1 of a table.

  • fasta2tab (devteam) — FASTA → tabular, header in col 1, sequence in col 2. descr_columns: "1", keep_first: "0" ($IWC_FORMAT2/microbiome/pathogen-identification/gene-based-pathogen-identification/Gene-based-Pathogen-Identification.gxwf.yml:286, step “sample_specific_contigs_tabular_file_preparation”). Also $IWC_FORMAT2/microbiome/metagenomic-genes-catalogue/metagenomic-genes-catalogue.gxwf.yml, $IWC_FORMAT2/microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml, $IWC_FORMAT2/proteomics/clinicalmp/clinicalmp-discovery/iwc-clinicalmp-discovery-workflow.gxwf.yml.
  • tab2fasta (devteam) — tabular → FASTA, the inverse. Co-occurs with fasta2tab in three of the four interconversion workflows (the pathogen-identification + metagenomic-genes family); clinicalmp-discovery uses fasta2tab one-way only (FASTA into a downstream tabular join).

FASTA reformat — line-width rewrap (4 workflows, ~2 contexts)

  • cshl_fasta_formatter (devteam) — rewrap to fixed line width. width: "60", single connected input ($IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-quality-control-paired-end/mgnify-amplicon-pipeline-v5-quality-control-paired-end.gxwf.yml:661-664, step “Paired-end post quality control FASTA files”). The width rewrap is the only parameter exercised in the corpus — case folding and header-only output are catalog capabilities with zero uptake here.

Merge + dedup (4 workflows)

  • fasta_merge_files_and_filter_unique_sequences (galaxyp) — concatenate a set of FASTAs and drop duplicate records in one step. The corpus pins uniqueness_criterion: sequence (dedup by sequence content, not header) with accession_parser: ^>([^ ]+).*$ and batchmode.processmode: merge ($IWC_FORMAT2/microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:1204). The dedup-aware merge is the differentiator from a plain cat of FASTA files.

Sequence length → tabular (3 workflows)

  • fasta_compute_length (devteam) — emit a tabular (id, length) table. keep_first: "0" (all), keep_first_word: false ($IWC_FORMAT2/VGP-assembly-v2/Purge-duplicate-contigs-VGP6/Purge-duplicate-contigs-VGP6.gxwf.yml:1655). Output is tabular, so this is also a sequence→tabular bridge (§5), distinct in purpose from fasta2tab (which carries the sequence too).

Extract / mask by intervals — the #268 shared seam (3 / 2 workflows)

  • bedtools_getfastabed (iuc) — extract FASTA at BED intervals. fasta_source_selector: history, nameOnly/split/strand/tab: false ($IWC_FORMAT2/microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:576).
  • bedtools_maskfastabed (iuc) — mask FASTA regions named by a BED. mc: N (mask character), soft: false (hard mask, lowercase when soft) ($IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-complete/mgnify-amplicon-pipeline-v5-complete.gxwf.yml:3770).

These two consume intervals and produce sequence. #268 deliberately held them out of interval algebra on the output-shape rule (produces sequence records → sequence, not interval); see iwc-interval-operations-survey §5. They are the canonical cross-MOC bridge, not the sequence core.

Extract transcript/CDS FASTA from annotation (4 workflows; pulled in per scope decision)

  • gffread (devteam) — read a GFF/GTF + genome FASTA, emit transcript/CDS FASTA. $IWC_FORMAT2/genome_annotation/annotation-maker/Genome_annotation_with_maker_short.gxwf.yml:406; also annotation-braker3, annotation-helixer, lncRNAs-annotation. The operation is record-level sequence extraction (annotation→sequence, parallel to getfasta’s interval→sequence), so it is in scope per the #272 scope-edge decision — but every corpus instance lives in a genome_annotation pipeline, so it is annotation-domain-flavored. Treat as an in-scope operation with a Bridges note (§5).

Subset by id list — a two-tool redundancy (1 + 1 workflows)

  • seqtk_subseq (iuc) — keep records whose names appear in a connected name list. source.type: name, name_list connected, l: "0" ($IWC_FORMAT2/virology/influenza-isolates-consensus-and-subtyping/influenza-consensus-and-subtyping.gxwf.yml:505).
  • filter_by_fasta_ids (galaxyp) — same job, different tool: header_criteria_select: id_list, identifiers connected, id_regex.find: beginning, dedup: false, plus an output_discarded complement and a sequence_criteria branch ($IWC_FORMAT2/microbiome/metagenomic-genes-catalogue/metagenomic-genes-catalogue.gxwf.yml:934, step “Filter FASTA to keep CDS corresponding to ARGs”). Decision-point in §2.

Filter by length / translate / FASTQ→FASTA (thin — 1, 1, 3 workflows)

  • fasta_filter_by_length (devteam) — min_length: "0", connected max_length ($IWC_FORMAT2/VGP-assembly-v2/Assembly-decontamination-VGP9/Assembly-decontamination-VGP9.gxwf.yml:515, filtering to short mitochondrial scaffolds).
  • seqkit_translate (iuc) — nucleotide → protein, frame: "1", transl_table: "1" ($IWC_FORMAT2/microbiome/metagenomic-genes-catalogue/metagenomic-genes-catalogue.gxwf.yml:1332).
  • fastq_to_fasta_python (devteam) — FASTQ → FASTA, no parameters; pure record-format interconversion ($IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-quality-control-paired-end/mgnify-amplicon-pipeline-v5-quality-control-paired-end.gxwf.yml:566). In scope as interconversion per the #272 decision, though it sits at the reads-domain edge.

Absent — catalog capabilities with zero corpus uptake

Reverse-complement (standalone — reverse_complement: none appears only as a parameter inside HyPhy codon tools, $IWC_FORMAT2/comparative_genomics/hyphy/hyphy-core.gxwf.yml:157, not a sequence operation), sequence sort, GC/composition compute, standalone dedup (seqkit_rmdup), EMBOSS seqret/transeq, fasta_nucleotide_changer, fastx_collapser, fasta_clipping_histogram. Per corpus-first these are documented gaps, not candidate patterns; none is anti-pattern evidence, just no exemplar.

2. Redundancy / decision-points

Where the corpus shows more than one tool for one job — the boundaries a MOC must adjudicate.

  1. Subset FASTA by id list — seqtk_subseq (iuc) vs filter_by_fasta_ids (galaxyp). Both keep records named by a connected id list. filter_by_fasta_ids adds a discarded-complement output, regex anchoring (id_regex.find: beginning), an in-line dedup, and a sequence_criteria branch; seqtk_subseq is leaner and also does region/range subsetting (l: flank). Split cleanly by domain (virology reaches for seqtk; proteomics/microbiome for filter_by_fasta_ids), mirroring the bedtools-vs-gops and tp_grep-vs-Grep1 redundancies the sibling surveys resolved. → Q1.
  2. Sequence length / stats — fasta_compute_length (devteam) vs fasta_stats (iuc) vs gfastats (bgruening). fasta_compute_length emits a clean (id, length) table — the record-level move. fasta_stats ($IWC_FORMAT2/genome-assembly/assembly-with-flye/Genome-assembly-with-Flye.gxwf.yml, “Fasta Statistics”) and gfastats are aggregate-statistics / assembly tools. The length-table operation is in scope; aggregate-stats is domain. The page must not absorb assembly stats. → Q2.
  3. Merge FASTA — dedup-aware merge vs plain concatenation. fasta_merge_files_and_filter_unique_sequences merges and dedups by sequence; four workflows reach for it specifically when duplicate records must go (proteomics search DBs, pathogen aggregation). A plain tp_cat/text concat would merge without dedup. The page should make the dedup criterion (uniqueness_criterion: sequence) the reason-to-use, not the merge alone. → Q3.
  4. Interconversion direction vs purpose. fasta2tab appears both as the front half of a roundtrip (→ edit → tab2fasta) and one-way (FASTA into a tabular join, clinicalmp-discovery). One operation page covering both directions, with the roundtrip as a separate recipe? → Q4.

3. Recurring idioms

Single-tool parameter idioms (with citations)

  • fasta_formatter only ever rewraps width in the corpus (width: "60", …quality-control-paired-end…:664). Case folding / header-only modes are unused.
  • fasta_merge_files_and_filter_unique_sequences dedups by sequence, not header (uniqueness_criterion: sequence, …PathoGFAIR…:1204) — the same record can carry different headers across inputs; sequence-identity dedup is the point.
  • fasta2tab splits to exactly two columns (descr_columns: "1", keep_first: "0") so the header is col 1 and the full sequence is col 2 — the shape the relabel recipe (below) depends on.
  • maskfasta picks hard vs soft via soft: (soft: false + mc: N = replace with N; soft: true = lowercase), …mgnify…complete…:3770.

Multi-step recipe — relabel FASTA headers via the tabular detour (the high-value unit)

Invisible to grep; the most reusable sequence construct in the corpus. fasta2tabtp_find_and_replace on column 1 → tab2fasta — convert records to a (header, sequence) table, rewrite the header column with text processing, convert back. Confirmed tight in $IWC_FORMAT2/microbiome/pathogen-identification/gene-based-pathogen-identification/Gene-based-Pathogen-Identification.gxwf.yml (fasta2tab step 9 :272tp_find_and_replace on column: "1" step 12 :347-375tab2fasta step 15 :474); the find/replace injects a per-sample id into each header via a connected replace_pattern. The same fasta2tabtab2fasta envelope recurs in metagenomic-genes-catalogue and PathoGFAIR (interconversion pair co-present, with tabular text-processing between). This is a sequence-record-specific cousin of regex-relabel-via-tabular (which relabels collection identifiers, not record headers) — distinct enough to warrant its own page, cross-linked to the tabular relabel patterns. Keep.

4. Candidate pattern boundaries

Operation-/recipe-anchored names per docs/PATTERNS.md. Because the interconversion seam and its recipe carry the value, the keep-set is interconversion-led with a recipe at the top.

Recipe (keep — highest value):

Operations (keep ≥2-workflow ones as standalone; thin ones become ingredients/footnotes):

  • sequence-fasta-tabular-interconvert (fasta2tab + tab2fasta, both directions) — Evidence: 4 / 3 workflows. Keep — the strongest standalone operation; it underpins the recipe and the sequence↔tabular bridge. Q4 decides one page vs two.
  • sequence-reformat-line-width (fasta_formatter width rewrap) — Evidence: 4 workflows (≈2 contexts). Keep, scoped tightly to the width-rewrap move (the only one used).
  • sequence-merge-and-dedup (fasta_merge_files_and_filter_unique_sequences) — Evidence: 4 workflows. Keep; lead with the dedup-by-sequence criterion (Q3).
  • sequence-compute-length (fasta_compute_length → id/length table) — Evidence: 3 workflows. Keep, fenced off from aggregate assembly stats (Q2).
  • sequence-extract-at-intervals (getfasta) — Evidence: 3 workflows. Keep, but author as a bridge page shared with #268 (output-shape rule), not a standalone interval-ignorant op.
  • sequence-subset-by-id (seqtk_subseq / filter_by_fasta_ids) — Evidence: 2 workflows across two tools. Keep, flag the redundancy (Q1); lead-tool TBD.
  • sequence-mask-by-intervals (maskfasta) — Evidence: 2 workflows. Keep-or-merge into the extract-at-intervals bridge page as its sibling move. → Q5.
  • sequence-extract-from-annotation (gffread) — Evidence: 4 workflows, all genome_annotation. Keep, flag domain-embedding; or fold into the extract bridge family. → Q5.

Drop as standalone (thin / single-source — document as ingredients or footnotes):

  • filter-by-length (fasta_filter_by_length, 1 wf), translate (seqkit_translate, 1 wf), FASTQ→FASTA (fastq_to_fasta_python, reads-edge). Single-source; mention inside the MOC’s operation list, no standalone pages unless a second exemplar appears. → Q6.

Gaps (document, no page): reverse-complement, sequence sort, composition/GC, standalone dedup, EMBOSS seqret/transeq — §1 “Absent”.

  • sequence ↔ tabular. The dominant seam. fasta2tab/tab2fasta interconvert; fasta_compute_length emits a (id, length) table; the relabel recipe lives entirely on this seam. The line is: tabular treats the record as opaque columns (header = col 1, sequence = col 2); sequence operations understand FASTA record structure. Cross-link galaxy-tabular-patterns and iwc-tabular-operations-survey.
  • sequence ↔ interval. getfasta (intervals → sequence) and maskfasta (intervals + sequence → masked sequence) are the inverse-facing seam to #268, which already documented these as out-of-interval-scope on the output-shape rule (see iwc-interval-operations-survey §5). Each MOC’s ## Bridges should point at the other; the sequence MOC owns the operations, #268 owns the BED that feeds them. Cross-link galaxy-interval-patterns.
  • alignment → sequence (scope edge). samtools_fastx (BAM → FASTQ/FASTA, 2 workflows: host-contamination-removal, nanopore pre-processing) and bedtools_bamtobed (BAM → BED, owned by #268) convert alignment records to sequence/interval. The sequence MOC notes samtools_fastx as an input bridge; it is not a sequence-record manipulation, so no page.
  • annotation → sequence (scope edge). gffread extracts transcript/CDS FASTA from GFF + genome. Pulled in as an operation per the #272 decision, but flagged here as annotation-domain-resident; the Bridges note links it to the annotation pipelines that produce its GFF input.

6. Out of scope — the assembly elephant and domain consumers

Named so the held-out set is deliberate, not an oversight:

  • gfastats (bgruening, 113 occ / 10 VGP workflows) — assembly FASTA↔GFA conversion + assembly statistics. Domain (assembly). The corpus’s largest FASTA-touching tool; out.
  • Fasta_to_Contig2Bin, concoct_cut_up_fasta, concoct_extract_fasta_bins — metagenomic binning. Domain.
  • peptideshaker fasta_cli — proteomics search-DB build. Domain.
  • fastqc (7 wf), assembly/alignment/annotation/variant/search tools — consume or emit sequence but do not manipulate records. Domain, per the #272 scope rule.

7. Beyond IWC (not surveyed)

Unlike the interval survey, this cluster did not need a GTN cross-reference to justify graduation — the IWC corpus carries enough record-level operations on its own. The absent capabilities in §1 (reverse-complement, sequence sort, composition) are common in GTN training material and the Tool Shed, but per corpus-first they stay gaps until an IWC exemplar appears. No GTN mining was performed this run; flag if /iwc-survey-act wants the absent-everywhere vs IWC-absent distinction the interval survey drew for closest.

8. Open questions

  1. Q1 — subset-by-id lead tool. seqtk_subseq (iuc, leaner, also does ranges) vs filter_by_fasta_ids (galaxyp, discarded-complement + regex anchoring)? Both single-workflow; pick a recommended tool and footnote the other, or present co-equal pending a tiebreak? Evidence §2.1.
  2. Q2 — length vs stats boundary. Confirm sequence-compute-length covers only the (id, length) table and explicitly excludes fasta_stats/gfastats aggregate assembly stats. Evidence §2.2.
  3. Q3 — merge page framing. Lead sequence-merge-and-dedup with the dedup-by-sequence criterion (vs plain concat)? Or split “merge” from “dedup”? Evidence §2.3.
  4. Q4 — interconversion page shape. One sequence-fasta-tabular-interconvert page covering both directions, with relabel-fasta-headers-via-tabular as a separate recipe; or fold the one-way fasta2tab→join case in too? Evidence §1, §3.
  5. Q5 — extract/mask bridge granularity. Is there one sequence-extract-at-intervals bridge page that also holds maskfasta (mask-by-intervals) and gffread (extract-at-annotation) as sibling moves, or three pages? Lean: one bridge page with three moves, given each is thin. Evidence §4, §5. (Resolved in the #272 PR: one page, authored as sequence-extract-by-region with three moves.)
  6. Q6 — does the MOC graduate now, or hold? The gate. Moderate corpus, interconversion-led, recipe-carried, with several thin single-source operations and the motivating interval-bridge tools shared with #268. Graduate a modest interconversion-led MOC, or hold the thin operations (translate, filter-by-length, subset-by-id) until second exemplars appear? Decision owned by /iwc-survey-act. Evidence: TL;DR + §4.
  7. Q7 — MOC naming. Issue specifies slug galaxy-sequence-patterns, title “Galaxy: sequence patterns”, topic tag topic/sequence-transform. (Resolved in the #272 PR: naming confirmed as specified; topic/sequence-transform registered in meta_tags.yml.)

Incoming References (7)

  • Galaxy: sequence patternsrelated note— Use this MOC to choose corpus-grounded Galaxy operations on sequence records (FASTA) — interconvert, reformat, merge, length, extract/mask by region.
  • Sequence: relabel FASTA headers via tabularrelated note— Edit FASTA headers you cannot easily regex in place: fasta2tab, rewrite column 1 with find/replace, then tab2fasta back. The high-value sequence recipe.
  • Sequence: compute record lengthsrelated note— Emit a (id, length) table from a FASTA so downstream tabular steps can filter, sort, or threshold records by length; fasta_compute_length.
  • Sequence: extract or mask by regionrelated note— Turn coordinates into sequence: extract FASTA at BED intervals (getfasta), mask regions by BED (maskfasta), or extract transcripts from a GFF (gffread).
  • Sequence: interconvert FASTA and tabularrelated note— Move sequence records between FASTA and a (header, sequence) table so tabular tools can edit them; fasta2tab one way, tab2fasta back.
  • Sequence: merge FASTA and filter uniquerelated note— Concatenate several FASTA files into one and drop duplicate records by sequence identity in a single step; fasta_merge_files_and_filter_unique_sequences.
  • Sequence: reformat FASTA line widthrelated note— Rewrap FASTA records to a fixed sequence-line width so downstream tools and viewers get canonical 60/70/80-column output; cshl_fasta_formatter.