IWC genomic interval operations survey
Backs #268 — a candidate galaxy-interval-patterns MOC, third sibling to galaxy-collection-patterns and galaxy-tabular-patterns. Scope is coordinate-aware feature algebra (operations that understand chrom/start/end/strand), not the domain tools that produce or consume intervals. The issue framed this explicitly as a hold-if-thin gate: the survey’s first job is to size the corpus honestly and say plainly whether interval algebra is rich enough to graduate a MOC.
Source corpus: $IWC_FORMAT2/ — 120 .gxwf.yml files. Citations are $IWC_FORMAT2/path:line. Distinct-step counts below exclude the trailing unique_tools: manifest block (which double-counts) and note where subworkflow embedding inflates a raw grep.
TL;DR — the sizing finding
Interval algebra is real but moderate, and narrower than its two siblings. It clusters in two domains — epigenetics peak-consensus (bedtools) and SARS-CoV-2 consensus/masking (the legacy “Operate on Genomic Intervals” gops_* family) — plus scattered coverage-track generation. The highest-value units are multi-step recipes, not single operations; individually the operations are thin (slop and subtract appear once each).
Two findings sharpen the gate:
- The operation that motivated #268 —
closest/ proximity (the bedtools-closest“nearest feature + distance” move) — has zero interval-algebra uptake. A grep forclosesthits exactly one workflow, and it is an output label (“successful VAPOR runs - closest references”,$IWC_FORMAT2/virology/influenza-isolates-consensus-and-subtyping/influenza-consensus-and-subtyping.gxwf.yml:30), not a tool. There is nobedtools_closestbed, no legacy “fetch closest non-overlapping feature”, and nowindowbedanywhere in the corpus. Proximity as a task (nearest TSS/gene + distance) is also entirely absent from IWC — no peak annotators (ChIPseeker/HOMER/GREAT/UROPA), nocomputeMatrix. Per corpus-first, proximity gets a gap note, not a pattern page. The hole that prompted the issue cannot be filled from IWC. The natural tool —bedtools_closestbed(-dfor distance) — exists in the sameiuc/bedtoolssuite the corpus already uses nine subcommands from, so this is a corpus gap, not a Galaxy-capability gap: with no recurring exemplar there is no common pattern to capture, and an implementing agent reaches for the tool directly. (The GTNcomputeMatrixidiom noted in §GTN cross-reference is a narrower, ChIP/ATAC-specific proximity, not a general substitute.) - Several catalog operations are also absent:
complement,makewindows,map,window(-wneighborhood join),annotate, and bedtoolsjaccard— zero each. (An earlier loose grep matched “jaccard” 11×; all are QIIME2 beta-diversity distance matrices —$IWC_FORMAT2/amplicon/qiime2/qiime2-III-VI-downsteam/QIIME2-VI-diversity-metrics-and-estimations.gxwf.yml:126— not the bedtools interval-similarity tool. Struck.)
So the recommendation this survey enables (the decision belongs to /iwc-survey-act, not here): a MOC is defensible but should lead with recipes and stay small; do not pad it with speculative operation pages for absent operations.
1. What exists — operation inventory
Distinct interval-algebra steps, by operation. Counts are evidence, not the headline.
Overlap / intersect — the strongest single operation (5 distinct workflows)
bedtools_intersectbed(iuc) — keep/annotate features by overlap with a second feature set.$IWC_FORMAT2/epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:629(“get merged peaks overlapping at least x replicates”) —overlap_mode: [-wa, -wb],reduce_or_iterate_selector: iterate. Same step in the two sibling workflows:.../consensus-peaks-chip-pe.gxwf.yml:581,.../consensus-peaks-chip-sr.gxwf.yml:579.$IWC_FORMAT2/sars-cov-2-variant-calling/sars-cov-2-ont-artic-variant-calling/ont-artic-variation.gxwf.yml:648(“Adjusted variant calls within primer binding sites”) — intersect variants against a primer-binding-site BED,overlap_mode: [-wa](report A only),once: "true".
bedtools_multiintersectbed(iuc) — N-way overlap across a collection of feature sets at once.$IWC_FORMAT2/epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:219(“compute multi intersect”) —tag_select: tag, input is a collection of per-replicate narrowPeak files,cluster: false,filler: "0". Siblings:.../consensus-peaks-chip-pe.gxwf.yml:195,.../consensus-peaks-chip-sr.gxwf.yml:194.
vcfvcfintersect(devteam) — VCF-native intersection, stays in VCF space rather than converting to BED.$IWC_FORMAT2/sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-variant-calling/pe-artic-variation.gxwf.yml:891—isect_union: -i,invert: true(keep variants absent from the second VCF), requires a reference FASTA.
Coverage / quantification — two distinct sub-operations sharing a name
bedtools_genomecoveragebed(iuc) — whole-genome coverage as a bedgraph. Two purposes in the corpus:- Mask input:
$IWC_FORMAT2/sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variation.gxwf.yml:160(“Depth of coverage”) —report_select: bg,zero_regions: true,split: true; feeds a depth-threshold filter (below). - Scaled coverage track:
$IWC_FORMAT2/transcriptomics/rnaseq-pe/rnaseq-pe.gxwf.yml:614(“Scaled Coverage both strands combined”) —report_select: bg, a connectedscaleparameter (1/million-reads normalization). Same in$IWC_FORMAT2/transcriptomics/rnaseq-sr/rnaseq-sr.gxwf.yml:582.
- Mask input:
bedtools_coveragebed(iuc) — count/measure reads falling in given regions (region-restricted, not genome-wide).$IWC_FORMAT2/epigenetics/atacseq/atacseq.gxwf.yml:733(“Compute coverage on summits +/-500kb”) — inputA = merged windows, inputB = deduplicated-reads BAM,reduce_or_iterate_selector: iterate,hist: false,mean: false.
Merge overlapping features (~3 workflows)
bedtools_mergebed(iuc):$IWC_FORMAT2/epigenetics/atacseq/atacseq.gxwf.yml:672(“Merge summits +/-500kb”) —distance: "0",c_and_o_argument_repeat: [];$IWC_FORMAT2/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/hi-c-map-for-assembly-manual-curation.gxwf.yml:3512.gops_merge_1(devteam, “Operate on Genomic Intervals”):$IWC_FORMAT2/sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variation.gxwf.yml:445(“Combined low-coverage regions and filter-failed sites”) —returntype: true.
Set algebra — concat / subtract (legacy gops_* only)
gops_concat_1(devteam) — concatenate interval sets.$IWC_FORMAT2/sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variation.gxwf.yml:420(sameformat: false);$IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-rrna-prediction/mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:299(“SSU BED”, combining per-feature-type rRNA prediction BEDs,sameformat: true). The mgnify raw grep count (17) is inflated — the same rRNA-prediction subworkflow is embedded several times insidemgnify-amplicon-pipeline-v5-complete.gxwf.yml(:2178,:2205,:2261, …); distinct authoring contexts are ~2.gops_subtract_1(devteam) — interval difference. Single occurrence:$IWC_FORMAT2/sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variation.gxwf.yml:467(“Masking regions”) —min: "1",returntype: -p.
Window / flank (slop) — single occurrence
bedtools_slopbed(iuc):$IWC_FORMAT2/epigenetics/atacseq/atacseq.gxwf.yml:547(“get summits +/-500kb”) —addition: {addition_select: b, b: "500"}(extend 500 bp both sides),genome_file_opts: loc. Label-drift gotcha: the step is labelled ”+/-500kb” and renamed “1kb around each summit”, butb: "500"is 500 bp (a 1 kb total window). Loose labels are common; trust the parameter, not the label.
Absent — catalog capabilities with zero corpus uptake
closest/proximity, complement, makewindows, map (interval→value mapping), window (-w neighborhood join), annotate, jaccard. Per corpus-first these are documented gaps, not candidate patterns. None is anti-pattern evidence — just no exemplar.
2. Redundancy / decision-points
Where the corpus shows more than one tool competing for the same job — the boundaries the MOC must adjudicate.
- Modern
bedtools_*(iuc) vs legacygops_*“Operate on Genomic Intervals” (devteam). The corpus splits cleanly by domain, not by capability: epigenetics (atacseq,consensus-peaks) and VGP reach forbedtools_*; the SARS-CoV-2consensus-from-variationmasking pipeline usesgops_concat+gops_merge+gops_subtract. This mirrors theGrep1-vs-tp_grep_toolredundancy the tabular survey resolved (see iwc-tabular-operations-survey §7 /docs/PATTERNS.md“Legacy tools as footnotes”). Caveat against a clean 1:1 mapping:bedtoolshasmergebed(matchesgops_merge) but the corpus has nobedtools_subtractbed, andgops_concatis barely interval-aware (it is closer to plain concatenation than to set union). So a “recommend bedtools, footnote gops” rule is directionally right formerge, butsubtract/concatneed their own call. → Q1. - VCF-native vs BED-native intersection. Intersecting variants with regions is done two ways: convert variants to a BED-shaped feature set and use
bedtools_intersectbed(ont-artic-variation.gxwf.yml:648), or stay in VCF space withvcfvcfintersect(pe-artic-variation.gxwf.yml:891). The decision turns on whether downstream needs VCF semantics (INFO/FORMAT preserved) or just coordinates. → Q2. - “Coverage” is two operations.
genomecoveragebed(genome-wide bedgraph, for masking input or a normalized track) andcoveragebed(reads-in-given-regions count) share a name and a tool family but answer different questions. A single “coverage” pattern page would have to hold both modes explicitly. → Q4.
3. Recurring idioms
Single-tool parameter idioms (with citations)
bedtools_intersectbedalways maps over a collection viareduce_or_iterate_selector: iterate(consensus-peaks-atac-cutandrun.gxwf.yml:660,ont-artic-variation.gxwf.yml:675).reduceis never used in the corpus.overlap_modeis the real lever:[-wa, -wb]to report both sides (:657),[-wa]to report only the A feature (ont-artic-variation.gxwf.yml:673).genomecoveragebedfor a track pinsreport_select: bgwith a connectedscaleparameter; for masking it addszero_regions: trueso zero-depth spans become explicit rows the downstream filter can catch (consensus-from-variation.gxwf.yml:181-185).bedtools_slopbedneeds a genome file (genome_file_opts_selector: loc) to clamp extensions at chromosome ends — coordinate-awareness the tabular equivalent (tabular-synthesize-bed-from-3col) cannot provide.
Multi-step recipes (the high-value units)
These are invisible to grep and are where interval algebra actually earns a MOC.
- Consensus regions by multi-intersect-then-threshold —
multiintersectacross a collection of per-replicate peak sets →Filter1on the replicate-count column →intersectthe merged-sample peaks back against the thresholded consensus regions. Three sibling workflows, identical shape:consensus-peaks-atac-cutandrun.gxwf.yml:219→:318(filter)→:629;consensus-peaks-chip-pe.gxwf.yml:195→…→:581;consensus-peaks-chip-sr.gxwf.yml:194→…→:579. The replicate-count threshold is itself computed (table_computemin +wc_gnu,consensus-peaks-atac-cutandrun.gxwf.yml:264-316). - Build a masking BED by interval set algebra —
genomecoverage(bedgraph) →Filter1 c4 < threshold(“Low-coverage regions”,consensus-from-variation.gxwf.yml:275) → build variant-site BEDs via coordinate arithmetic (below) →gops_concat(:420) →gops_merge(:445) →gops_subtract(:467). Net:mask = (low-coverage ∪ filter-failed-sites) − called-variant-sites. This is the flagship interval-algebra recipe and the nearest corpus analogue to the kind of region computation #268 wanted — but it is set algebra, not proximity. - Windowed coverage around features —
slop(fixed ±N bp window around summits) →merge(collapse overlapping windows) →coverage(reads per window):atacseq.gxwf.yml:547→:672→:733.
4. Candidate pattern boundaries
Operation-anchored names per docs/PATTERNS.md. Because the operations are thin and the recipes carry the value, the keep-set leans recipe-heavy — the inverse of the tabular survey.
Recipes (keep — highest value):
recipe: interval-consensus-by-multi-intersect— the §3 recipe #1. Evidence: three sibling workflows. Keep. This is the single most reusable interval construct in the corpus.recipe: interval-mask-by-set-algebra— the §3 recipe #2 (concat → merge → subtract to compute a region mask). Evidence:consensus-from-variation. Single workflow, but it is the canonical “compute regions from regions” move and exercises three set operations at once. Keep, flag single-source.recipe: windowed-coverage-around-features— the §3 recipe #3 (slop → merge → coverage). Evidence:atacseq. Single workflow. Keep-or-merge — could fold into a coverage operation page as its worked example. → Q5.
Operations (keep the ones with ≥2 distinct workflows; the rest become recipe ingredients, not standalone pages):
interval-overlap-filter(intersect / multi-intersect) —bedtools_intersectbed+bedtools_multiintersectbed, with theiterate+overlap_modeidioms and the vcf-native alternative as a footnote. Evidence: 5 workflows. Keep — the one solid standalone operation page.interval-coverage—genomecoveragebed(track / mask input) +coveragebed(reads-in-regions), two modes. Evidence: 4 workflows. Keep, but the page must split the two modes (Q4).interval-merge-overlapping—bedtools_mergebed/gops_merge. Evidence: 3 workflows. Keep (also a set-algebra-recipe ingredient).interval-window-flank(slop) — single occurrence; the label-drift gotcha is the only real teaching content. Drop as standalone; document inside the windowed-coverage recipe.interval-set-subtract/interval-concat— single / inflated occurrences, legacygops_*only. Drop as standalone; document inside the mask-by-set-algebra recipe, which is where they actually appear.
Gaps (document, no page): closest/proximity, complement, makewindows, map, window, annotate, jaccard — §1 “Absent”.
5. Bridges (cross-shape seams the MOC should cross-link)
- interval ↔ collection. Every
bedtools_intersectbed/coveragebedstep usesreduce_or_iterate_selector: iterateto map over a per-sample collection (consensus-peaks-atac-cutandrun.gxwf.yml:660). The interval MOC’s## Bridgesshould point at galaxy-collection-patterns for the map-over mechanics. - interval ↔ tabular. Two directions observed: (a) coordinate construction —
column_makerbuilds BED start/end from VCF POS with indel-length arithmeticint(c2) - (len(c3) == 1)/int(c2) + ((len(c3) - 1) or 1)thenchange_datatype: bed(consensus-from-variation.gxwf.yml:367-377), a coordinate-aware cousin of tabular-synthesize-bed-from-3col; (b) coordinate filtering —Filter1 c4 < Non a bedgraph treats the interval file as a plain table (consensus-from-variation.gxwf.yml:275). The line #268 drew (tabular = opaque columns; interval = coordinate-aware) is exactly this seam. Cross-link galaxy-tabular-patterns. - interval ↔ alignment / sequence (scope edge — out of scope here).
bedtools_getfastabed(16),bedtools_maskfastabed(5), andbedtools_bamtobed(8) consume intervals but produce sequence or convert format, not interval-algebra output. They belong to a prospectivegalaxy-sequence-patternsMOC (a healthy corpus cluster —fasta_formatter20,fasta_to_tabular/tab2fasta13/11,fasta_merge_files_and_filter_unique_sequences14,fasta_compute_length7, …), tracked as the #268 follow-up. Documented here only to fix the boundary; no candidate patterns.
GTN cross-reference (training corpus, not IWC)
Non-corpus signal, kept separate from the IWC claims above. These citations are into the GTN training-material repo (galaxyproject/training-material), not the IWC corpus — they do not satisfy corpus-first and graduate no pattern page. They are recorded because they informed two decisions: promoting slop to its own page, and characterising which gaps are merely IWC-absent vs absent-everywhere. Distinct-tutorial counts (raw reference counts collapse to few tutorials):
slop—topics/transcriptomics/tutorials/clipseq/tutorial.md:478(SlopBed→Extract Genomic DNA): extend crosslink sites to a window before fetching sequence. Same “extend to a neighborhood” shape as the IWCatacsequse — a genuine independent second example, the basis for theinterval-window-flankpage.complement— one tutorial,topics/assembly/tutorials/ecoli_comparison/tutorial.md:965(ComplementBed, to find unaligned/unique regions). Zero IWC. A first example in the wrong corpus, not a second — elevates complement from silent gap to “GTN-attested, IWC-absent,” not to a page.- coordinate-aware
sort— three tutorials (topics/assembly/tutorials/ecoli_comparison,topics/epigenetics/tutorials/formation_of_super-structures_on_xi/tutorial.md:639,topics/single-cell/tutorials/scatac-preprocessing-tenx/tutorial.md:256) usebedtools_sortbed. IWC sorts intervals with tabularsort1instead — a real interval↔tabular seam. Not in the IWC gap list above because IWC does the operation, just not coordinate-aware; recorded here as a GTN-surfaced candidate. Zero IWCbedtools_sortbed. closest/proximity — no interval-algebra form in GTN (nobedtools_closestbed, no fetch-closest, nowindowbed). But proximity as a task is present in GTN, only ever as a domain step:deeptools computeMatrixin reference-point mode (signal aggregated relative to nearest TSS) is taught across ~6 epigenetics tutorials (topics/epigenetics/tutorials/{atac-seq,cut_and_run,tal1-binding-site-identification,estrogen-receptor-binding-site-identification,formation_of_super-structures_on_xi,methylation-seq}). That is proximity-adjacent (reference-point signal, not “nearest feature + distance”) and a domain/visualization tool — out of scope per #268. (Apparent ChIPseeker/UROPA usage was a grep artifact:uropamatched “Europa” URLs;chipseekerappears only in an admin tool-install snippet, not an analysis tutorial.) Net:bedtools_closestbedis the natural interval-algebra tool for nearest-feature-plus-distance and exists in-suite, but has zero uptake in either corpus — so no pattern page (corpus-first), a corpus gap rather than a capability gap. The GTNcomputeMatrixidiom is a different, narrower proximity (signal aligned at TSS), not a general substitute for it.
6. Open questions
- Q1 —
bedtools_*vsgops_*recommendation. Adopt the tabular “modern tool leads, legacy footnoted” rule (recommendbedtools_mergebed, footnotegops_merge)? It applies cleanly tomerge, butgops_subtracthas no corpusbedtools_subtractbedandgops_concatis barely interval-aware. Per-operation call needed; evidence in §2.1. - Q2 — VCF-native vs BED-native intersect. Which does the overlap page lead with, and does
vcfvcfintersectget a “stay in VCF space when INFO/FORMAT must survive” sidebar? Evidence: §2.2. - Q3 — does the MOC graduate now, or hold? The gate. Real but moderate corpus, recipe-led, with the motivating operation (
closest) absent and several catalog operations absent. Graduate a small recipe-led MOC, or hold until a second proximity/complement exemplar appears? This is the decision/iwc-survey-actowns; the survey’s job was to surface that the corpus is moderate-not-rich and proximity is empty. Evidence: TL;DR + §4. - Q4 — coverage page shape. One
interval-coveragepage with two explicit modes (genome-wide track vs reads-in-regions), or two pages? Evidence: §2.3. - Q5 — recipe vs operation granularity. Because operations are thin and recipes carry the value, should
slop/subtract/concatexist only as ingredients documented inside the three recipe pages, with just three standalone operation pages (overlap-filter,coverage,merge)? Lean: yes. Evidence: §4. - Q6 — naming of the MOC topic tag. Issue specifies
topic/interval-transformand title “Galaxy: genomic interval patterns”. Confirm before the MOC and tag are authored (the tag must be registered inmeta_tags.ymlfirst).