IWC sequence-record operations survey
Backs #272 — a candidate galaxy-sequence-patterns MOC, fourth sibling to galaxy-collection-patterns, galaxy-tabular-patterns, and galaxy-interval-patterns. Scope is record-level sequence (FASTA / nucleotide-or-protein record) manipulation — operations that read, reshape, subset, or interconvert sequence records — not the domain tools that analyze sequence (alignment, assembly, annotation, variant calling, search, profiling). Same discipline as #268: scope by data-shape family + manipulation algebra, not domain. The issue framed a hold-if-thin gate; the survey’s first job is to size the cluster honestly.
Source corpus: $IWC_FORMAT2/ — 120 .gxwf.yml files. Citations are $IWC_FORMAT2/path plus step label or tool name; line numbers only for stable parameter snippets.
TL;DR — the sizing finding
Sequence-record manipulation is real and slightly broader than interval algebra, but moderate — not the rich cluster the issue’s headline counts imply. The issue’s table ranks operations by tool_id-occurrence count (fasta_formatter 20, getfasta 16, merge 14, …). Those counts reproduce exactly, but they are inflated by the same subworkflow embedding the interval survey caught: distinct workflow counts are far lower.
| Operation (corpus-observed) | tool | tool_id occ | distinct wf |
|---|---|---|---|
| FASTA reformat (line width) | devteam/fasta_formatter (cshl_fasta_formatter) | 20 | 4 |
| Extract sequence at intervals | iuc/bedtools (bedtools_getfastabed) | 16 | 3 |
| Merge FASTA + filter unique sequences | galaxyp/fasta_merge_files_and_filter_unique_sequences | 14 | 4 |
| FASTA → tabular | devteam/fasta_to_tabular (fasta2tab) | 13 | 4 |
| tabular → FASTA | devteam/tabular_to_fasta (tab2fasta) | 11 | 3 |
| Sequence length compute | devteam/fasta_compute_length | 7 | 3 |
| Mask sequence by intervals | iuc/bedtools (bedtools_maskfastabed) | 5 | 2 |
| Extract transcript/CDS FASTA from annotation | devteam/gffread | 8 | 4 |
| Filter by length | devteam/fasta_filter_by_length | 2 | 1 |
| Translate (nt → protein) | iuc/seqkit_translate | 2 | 1 |
| Subset by id list | iuc/seqtk (seqtk_subseq) | 2 | 1 |
| Subset/filter by id list | galaxyp/filter_by_fasta_ids | 3 | 1 |
| FASTQ → FASTA | devteam/fastqtofasta (fastq_to_fasta_python) | 10 | 3 |
Three findings sharpen the gate:
- The healthy core is interconversion + reformat, not the interval-bridge tools.
getfasta/maskfastaare the operations the #268 scope-edge analysis surfaced, but in distinct-workflow terms they are mid-pack (3 / 2). The operations that actually recur are fasta↔tabular interconversion (fasta2tab/tab2fasta, 4 / 3) and reformat (fasta_formatter, 4) and merge+dedup (4). The single most reusable construct is a multi-step recipe — relabel FASTA headers by detouring through tabular — not any one operation (§3). - The 20-occurrence headline (
fasta_formatter) is ~2–3 authoring contexts. All fourfasta_formatterworkflows are the mgnify amplicon family (mgnify-amplicon-pipeline-v5-{quality-control-paired-end,quality-control-single-end,rrna-prediction,complete});…-completeembeds the other three as subworkflows. Same inflation hitsfasta2tab/tab2fasta(the pathogen-identification family) andgetfasta. Distinct-context counts run roughly half the workflow counts above. - The biggest FASTA-touching tool in the corpus is out of scope and must be named, or the inventory looks absurd.
bgruening/gfastatshas 113 tool_id occurrences across 10 VGP-assembly workflows — it dwarfs everything here. But it is an assembly tool (FASTA↔GFA conversion + assembly statistics: “Convert purged fasta to gfa”, “gfastats gfa hap1”,$IWC_FORMAT2/VGP-assembly-v2/...), domain-out per the issue. Counting it would make “sequence manipulation” read as a VGP-assembly concern, which it is not. Held out; recorded here so the hole is deliberate.
Net for the gate (the decision belongs to /iwc-survey-act, not here): a sequence MOC is defensible and modestly broader than interval, but should lead with the interconversion seam and its relabel recipe, keep the thin operations (translate, filter-by-length, subset-by-id) as footnoted ingredients, and treat getfasta/maskfasta as shared bridges with #268, not the headline.
1. What exists — operation inventory
Distinct sequence-record operations, by the move they make. Counts are evidence, not the headline; distinct-workflow counts lead (§TL;DR).
FASTA ↔ tabular interconversion — the strongest seam (4 / 3 workflows)
The corpus reaches for the tabular form whenever it needs to edit sequence records (headers especially), because FASTA headers are awkward to regex in place but trivial as column 1 of a table.
fasta2tab(devteam) — FASTA → tabular, header in col 1, sequence in col 2.descr_columns: "1",keep_first: "0"($IWC_FORMAT2/microbiome/pathogen-identification/gene-based-pathogen-identification/Gene-based-Pathogen-Identification.gxwf.yml:286, step “sample_specific_contigs_tabular_file_preparation”). Also$IWC_FORMAT2/microbiome/metagenomic-genes-catalogue/metagenomic-genes-catalogue.gxwf.yml,$IWC_FORMAT2/microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml,$IWC_FORMAT2/proteomics/clinicalmp/clinicalmp-discovery/iwc-clinicalmp-discovery-workflow.gxwf.yml.tab2fasta(devteam) — tabular → FASTA, the inverse. Co-occurs withfasta2tabin three of the four interconversion workflows (the pathogen-identification + metagenomic-genes family);clinicalmp-discoveryusesfasta2tabone-way only (FASTA into a downstream tabular join).
FASTA reformat — line-width rewrap (4 workflows, ~2 contexts)
cshl_fasta_formatter(devteam) — rewrap to fixed line width.width: "60", single connected input ($IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-quality-control-paired-end/mgnify-amplicon-pipeline-v5-quality-control-paired-end.gxwf.yml:661-664, step “Paired-end post quality control FASTA files”). The width rewrap is the only parameter exercised in the corpus — case folding and header-only output are catalog capabilities with zero uptake here.
Merge + dedup (4 workflows)
fasta_merge_files_and_filter_unique_sequences(galaxyp) — concatenate a set of FASTAs and drop duplicate records in one step. The corpus pinsuniqueness_criterion: sequence(dedup by sequence content, not header) withaccession_parser: ^>([^ ]+).*$andbatchmode.processmode: merge($IWC_FORMAT2/microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:1204). The dedup-aware merge is the differentiator from a plaincatof FASTA files.
Sequence length → tabular (3 workflows)
fasta_compute_length(devteam) — emit a tabular (id, length) table.keep_first: "0"(all),keep_first_word: false($IWC_FORMAT2/VGP-assembly-v2/Purge-duplicate-contigs-VGP6/Purge-duplicate-contigs-VGP6.gxwf.yml:1655). Output is tabular, so this is also a sequence→tabular bridge (§5), distinct in purpose fromfasta2tab(which carries the sequence too).
Extract / mask by intervals — the #268 shared seam (3 / 2 workflows)
bedtools_getfastabed(iuc) — extract FASTA at BED intervals.fasta_source_selector: history,nameOnly/split/strand/tab: false($IWC_FORMAT2/microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:576).bedtools_maskfastabed(iuc) — mask FASTA regions named by a BED.mc: N(mask character),soft: false(hard mask, lowercase when soft) ($IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-complete/mgnify-amplicon-pipeline-v5-complete.gxwf.yml:3770).
These two consume intervals and produce sequence. #268 deliberately held them out of interval algebra on the output-shape rule (produces sequence records → sequence, not interval); see iwc-interval-operations-survey §5. They are the canonical cross-MOC bridge, not the sequence core.
Extract transcript/CDS FASTA from annotation (4 workflows; pulled in per scope decision)
gffread(devteam) — read a GFF/GTF + genome FASTA, emit transcript/CDS FASTA.$IWC_FORMAT2/genome_annotation/annotation-maker/Genome_annotation_with_maker_short.gxwf.yml:406; also annotation-braker3, annotation-helixer, lncRNAs-annotation. The operation is record-level sequence extraction (annotation→sequence, parallel to getfasta’s interval→sequence), so it is in scope per the #272 scope-edge decision — but every corpus instance lives in agenome_annotationpipeline, so it is annotation-domain-flavored. Treat as an in-scope operation with a Bridges note (§5).
Subset by id list — a two-tool redundancy (1 + 1 workflows)
seqtk_subseq(iuc) — keep records whose names appear in a connected name list.source.type: name,name_listconnected,l: "0"($IWC_FORMAT2/virology/influenza-isolates-consensus-and-subtyping/influenza-consensus-and-subtyping.gxwf.yml:505).filter_by_fasta_ids(galaxyp) — same job, different tool:header_criteria_select: id_list,identifiersconnected,id_regex.find: beginning,dedup: false, plus anoutput_discardedcomplement and asequence_criteriabranch ($IWC_FORMAT2/microbiome/metagenomic-genes-catalogue/metagenomic-genes-catalogue.gxwf.yml:934, step “Filter FASTA to keep CDS corresponding to ARGs”). Decision-point in §2.
Filter by length / translate / FASTQ→FASTA (thin — 1, 1, 3 workflows)
fasta_filter_by_length(devteam) —min_length: "0", connectedmax_length($IWC_FORMAT2/VGP-assembly-v2/Assembly-decontamination-VGP9/Assembly-decontamination-VGP9.gxwf.yml:515, filtering to short mitochondrial scaffolds).seqkit_translate(iuc) — nucleotide → protein,frame: "1",transl_table: "1"($IWC_FORMAT2/microbiome/metagenomic-genes-catalogue/metagenomic-genes-catalogue.gxwf.yml:1332).fastq_to_fasta_python(devteam) — FASTQ → FASTA, no parameters; pure record-format interconversion ($IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-quality-control-paired-end/mgnify-amplicon-pipeline-v5-quality-control-paired-end.gxwf.yml:566). In scope as interconversion per the #272 decision, though it sits at the reads-domain edge.
Absent — catalog capabilities with zero corpus uptake
Reverse-complement (standalone — reverse_complement: none appears only as a parameter inside HyPhy codon tools, $IWC_FORMAT2/comparative_genomics/hyphy/hyphy-core.gxwf.yml:157, not a sequence operation), sequence sort, GC/composition compute, standalone dedup (seqkit_rmdup), EMBOSS seqret/transeq, fasta_nucleotide_changer, fastx_collapser, fasta_clipping_histogram. Per corpus-first these are documented gaps, not candidate patterns; none is anti-pattern evidence, just no exemplar.
2. Redundancy / decision-points
Where the corpus shows more than one tool for one job — the boundaries a MOC must adjudicate.
- Subset FASTA by id list —
seqtk_subseq(iuc) vsfilter_by_fasta_ids(galaxyp). Both keep records named by a connected id list.filter_by_fasta_idsadds a discarded-complement output, regex anchoring (id_regex.find: beginning), an in-linededup, and asequence_criteriabranch;seqtk_subseqis leaner and also does region/range subsetting (l:flank). Split cleanly by domain (virology reaches for seqtk; proteomics/microbiome for filter_by_fasta_ids), mirroring thebedtools-vs-gopsandtp_grep-vs-Grep1redundancies the sibling surveys resolved. → Q1. - Sequence length / stats —
fasta_compute_length(devteam) vsfasta_stats(iuc) vsgfastats(bgruening).fasta_compute_lengthemits a clean (id, length) table — the record-level move.fasta_stats($IWC_FORMAT2/genome-assembly/assembly-with-flye/Genome-assembly-with-Flye.gxwf.yml, “Fasta Statistics”) andgfastatsare aggregate-statistics / assembly tools. The length-table operation is in scope; aggregate-stats is domain. The page must not absorb assembly stats. → Q2. - Merge FASTA — dedup-aware merge vs plain concatenation.
fasta_merge_files_and_filter_unique_sequencesmerges and dedups by sequence; four workflows reach for it specifically when duplicate records must go (proteomics search DBs, pathogen aggregation). A plaintp_cat/text concat would merge without dedup. The page should make the dedup criterion (uniqueness_criterion: sequence) the reason-to-use, not the merge alone. → Q3. - Interconversion direction vs purpose.
fasta2tabappears both as the front half of a roundtrip (→ edit →tab2fasta) and one-way (FASTA into a tabular join,clinicalmp-discovery). One operation page covering both directions, with the roundtrip as a separate recipe? → Q4.
3. Recurring idioms
Single-tool parameter idioms (with citations)
fasta_formatteronly ever rewraps width in the corpus (width: "60",…quality-control-paired-end…:664). Case folding / header-only modes are unused.fasta_merge_files_and_filter_unique_sequencesdedups by sequence, not header (uniqueness_criterion: sequence,…PathoGFAIR…:1204) — the same record can carry different headers across inputs; sequence-identity dedup is the point.fasta2tabsplits to exactly two columns (descr_columns: "1",keep_first: "0") so the header is col 1 and the full sequence is col 2 — the shape the relabel recipe (below) depends on.maskfastapicks hard vs soft viasoft:(soft: false+mc: N= replace with N;soft: true= lowercase),…mgnify…complete…:3770.
Multi-step recipe — relabel FASTA headers via the tabular detour (the high-value unit)
Invisible to grep; the most reusable sequence construct in the corpus. fasta2tab → tp_find_and_replace on column 1 → tab2fasta — convert records to a (header, sequence) table, rewrite the header column with text processing, convert back. Confirmed tight in $IWC_FORMAT2/microbiome/pathogen-identification/gene-based-pathogen-identification/Gene-based-Pathogen-Identification.gxwf.yml (fasta2tab step 9 :272 → tp_find_and_replace on column: "1" step 12 :347-375 → tab2fasta step 15 :474); the find/replace injects a per-sample id into each header via a connected replace_pattern. The same fasta2tab…tab2fasta envelope recurs in metagenomic-genes-catalogue and PathoGFAIR (interconversion pair co-present, with tabular text-processing between). This is a sequence-record-specific cousin of regex-relabel-via-tabular (which relabels collection identifiers, not record headers) — distinct enough to warrant its own page, cross-linked to the tabular relabel patterns. Keep.
4. Candidate pattern boundaries
Operation-/recipe-anchored names per docs/PATTERNS.md. Because the interconversion seam and its recipe carry the value, the keep-set is interconversion-led with a recipe at the top.
Recipe (keep — highest value):
recipe: relabel-fasta-headers-via-tabular— §3 recipe. Evidence: 3 workflows, one tight-confirmed wiring. Keep. The flagship sequence construct; cross-link regex-relabel-via-tabular and galaxy-tabular-patterns.
Operations (keep ≥2-workflow ones as standalone; thin ones become ingredients/footnotes):
sequence-fasta-tabular-interconvert(fasta2tab+tab2fasta, both directions) — Evidence: 4 / 3 workflows. Keep — the strongest standalone operation; it underpins the recipe and the sequence↔tabular bridge. Q4 decides one page vs two.sequence-reformat-line-width(fasta_formatterwidth rewrap) — Evidence: 4 workflows (≈2 contexts). Keep, scoped tightly to the width-rewrap move (the only one used).sequence-merge-and-dedup(fasta_merge_files_and_filter_unique_sequences) — Evidence: 4 workflows. Keep; lead with the dedup-by-sequence criterion (Q3).sequence-compute-length(fasta_compute_length→ id/length table) — Evidence: 3 workflows. Keep, fenced off from aggregate assembly stats (Q2).sequence-extract-at-intervals(getfasta) — Evidence: 3 workflows. Keep, but author as a bridge page shared with #268 (output-shape rule), not a standalone interval-ignorant op.sequence-subset-by-id(seqtk_subseq/filter_by_fasta_ids) — Evidence: 2 workflows across two tools. Keep, flag the redundancy (Q1); lead-tool TBD.sequence-mask-by-intervals(maskfasta) — Evidence: 2 workflows. Keep-or-merge into the extract-at-intervals bridge page as its sibling move. → Q5.sequence-extract-from-annotation(gffread) — Evidence: 4 workflows, all genome_annotation. Keep, flag domain-embedding; or fold into the extract bridge family. → Q5.
Drop as standalone (thin / single-source — document as ingredients or footnotes):
- filter-by-length (
fasta_filter_by_length, 1 wf), translate (seqkit_translate, 1 wf), FASTQ→FASTA (fastq_to_fasta_python, reads-edge). Single-source; mention inside the MOC’s operation list, no standalone pages unless a second exemplar appears. → Q6.
Gaps (document, no page): reverse-complement, sequence sort, composition/GC, standalone dedup, EMBOSS seqret/transeq — §1 “Absent”.
5. Bridges (cross-shape seams the MOC should cross-link)
- sequence ↔ tabular. The dominant seam.
fasta2tab/tab2fastainterconvert;fasta_compute_lengthemits a (id, length) table; the relabel recipe lives entirely on this seam. The line is: tabular treats the record as opaque columns (header = col 1, sequence = col 2); sequence operations understand FASTA record structure. Cross-link galaxy-tabular-patterns and iwc-tabular-operations-survey. - sequence ↔ interval.
getfasta(intervals → sequence) andmaskfasta(intervals + sequence → masked sequence) are the inverse-facing seam to #268, which already documented these as out-of-interval-scope on the output-shape rule (see iwc-interval-operations-survey §5). Each MOC’s## Bridgesshould point at the other; the sequence MOC owns the operations, #268 owns the BED that feeds them. Cross-link galaxy-interval-patterns. - alignment → sequence (scope edge).
samtools_fastx(BAM → FASTQ/FASTA, 2 workflows: host-contamination-removal, nanopore pre-processing) andbedtools_bamtobed(BAM → BED, owned by #268) convert alignment records to sequence/interval. The sequence MOC notessamtools_fastxas an input bridge; it is not a sequence-record manipulation, so no page. - annotation → sequence (scope edge).
gffreadextracts transcript/CDS FASTA from GFF + genome. Pulled in as an operation per the #272 decision, but flagged here as annotation-domain-resident; the Bridges note links it to the annotation pipelines that produce its GFF input.
6. Out of scope — the assembly elephant and domain consumers
Named so the held-out set is deliberate, not an oversight:
gfastats(bgruening, 113 occ / 10 VGP workflows) — assembly FASTA↔GFA conversion + assembly statistics. Domain (assembly). The corpus’s largest FASTA-touching tool; out.Fasta_to_Contig2Bin,concoct_cut_up_fasta,concoct_extract_fasta_bins— metagenomic binning. Domain.peptideshaker fasta_cli— proteomics search-DB build. Domain.fastqc(7 wf), assembly/alignment/annotation/variant/search tools — consume or emit sequence but do not manipulate records. Domain, per the #272 scope rule.
7. Beyond IWC (not surveyed)
Unlike the interval survey, this cluster did not need a GTN cross-reference to justify graduation — the IWC corpus carries enough record-level operations on its own. The absent capabilities in §1 (reverse-complement, sequence sort, composition) are common in GTN training material and the Tool Shed, but per corpus-first they stay gaps until an IWC exemplar appears. No GTN mining was performed this run; flag if /iwc-survey-act wants the absent-everywhere vs IWC-absent distinction the interval survey drew for closest.
8. Open questions
- Q1 — subset-by-id lead tool.
seqtk_subseq(iuc, leaner, also does ranges) vsfilter_by_fasta_ids(galaxyp, discarded-complement + regex anchoring)? Both single-workflow; pick a recommended tool and footnote the other, or present co-equal pending a tiebreak? Evidence §2.1. - Q2 — length vs stats boundary. Confirm
sequence-compute-lengthcovers only the (id, length) table and explicitly excludesfasta_stats/gfastatsaggregate assembly stats. Evidence §2.2. - Q3 — merge page framing. Lead
sequence-merge-and-dedupwith the dedup-by-sequence criterion (vs plain concat)? Or split “merge” from “dedup”? Evidence §2.3. - Q4 — interconversion page shape. One
sequence-fasta-tabular-interconvertpage covering both directions, withrelabel-fasta-headers-via-tabularas a separate recipe; or fold the one-wayfasta2tab→joincase in too? Evidence §1, §3. - Q5 — extract/mask bridge granularity. Is there one
sequence-extract-at-intervalsbridge page that also holdsmaskfasta(mask-by-intervals) andgffread(extract-at-annotation) as sibling moves, or three pages? Lean: one bridge page with three moves, given each is thin. Evidence §4, §5. (Resolved in the #272 PR: one page, authored assequence-extract-by-regionwith three moves.) - Q6 — does the MOC graduate now, or hold? The gate. Moderate corpus, interconversion-led, recipe-carried, with several thin single-source operations and the motivating interval-bridge tools shared with #268. Graduate a modest interconversion-led MOC, or hold the thin operations (translate, filter-by-length, subset-by-id) until second exemplars appear? Decision owned by
/iwc-survey-act. Evidence: TL;DR + §4. - Q7 — MOC naming. Issue specifies slug
galaxy-sequence-patterns, title “Galaxy: sequence patterns”, topic tagtopic/sequence-transform. (Resolved in the #272 PR: naming confirmed as specified;topic/sequence-transformregistered inmeta_tags.yml.)