Home Pattern

Interval: build a mask by set algebra

Compute regions from regions: concatenate candidate intervals, merge into non-overlapping spans, then subtract the set to keep. The gops_* set-algebra recipe.

Revised
2026-06-10
Rev
1

Pattern health

warn
  • IWC exemplar anchors

    1 abstract workflow anchor declared.

  • Foundry verification fixture

    No structural verification fixture yet.

  • Pattern map coverage

    1 pattern map link here.

  • Metadata contract

    Pattern frontmatter matches the site contract.

Interval: build a mask by set algebra

Use this recipe when the output you need is regions computed from other regions — a mask, a keep-list, a complement-like set — built by combining and differencing interval sets rather than overlapping them. The SARS-CoV-2 consensus-from-variation workflow uses it to compute the genome positions to hard-mask before calling a consensus sequence. It is the flagship interval-algebra recipe in the corpus, and the nearest corpus analogue to “compute a region set” that motivated this MOC — though it is set algebra (union/difference), not proximity.

It is built entirely from the legacy “Operate on Genomic Intervals” gops_* family, which is co-equal with bedtools here (and the only corpus tool for subtract/concat).

The algebra

The target is: mask = (low-coverage ∪ filter-failed-sites) − called-variant-sites. Three set operations, in order.

1. Build the candidate interval sets

Two inputs arrive as BED-shaped intervals:

  • Low-coverage regions — from interval-coverage Mode A: genomecoveragebed (bedgraph, zero_regions: true) → Filter1 on c4 < threshold (“Low-coverage regions”). The depth threshold is composed at runtime (compose_text_param building c4 < N). This is a tabular filter on coordinate data; see tabular-filter-by-column-value.

  • Variant-site BEDs — the called and filter-failed variants are turned into BED intervals with column_maker, computing start/end from the VCF POS with indel-length arithmetic:

    ops:
      expressions:
        - { cond: "int(c2) - (len(c3) == 1)", add_column: { mode: R, pos: "2" } }
        - { cond: "int(c2) + ((len(c3) - 1) or 1)", add_column: { mode: R, pos: "3" } }

    then change_datatype: bed. This is a coordinate-aware cousin of tabular-synthesize-bed-from-3col — it accounts for the variant’s reference-allele span, not just a fixed 3-column copy.

2. Concatenate (union, with overlaps)

gops_concat_1 (“Concatenate two datasets into one dataset”) joins the low-coverage regions and the filter-failed sites into one interval set (“Cocatenated low-coverage regions and filter-failed sites” — the workflow’s own typo preserved). sameformat: false lets it concatenate interval files that differ in column count.

3. Merge (collapse the union)

gops_merge_1 collapses the concatenated set into non-overlapping spans (“Combined low-coverage regions and filter-failed sites”, returntype: true). See interval-merge-overlapping. After this the union is clean.

4. Subtract (difference out what to keep)

gops_subtract_1 removes the called-variant sites from the merged union, leaving the positions to mask (“Masking regions”). Corpus state: min: "1" (minimum 1 bp overlap to subtract), returntype: -p (return the pieces remaining).

tool_id: toolshed.g2.bx.psu.edu/repos/devteam/subtract/gops_subtract_1/1.0.0
tool_state:
  input1: { __class__: ConnectedValue }   # merged union
  input2: { __class__: ConnectedValue }   # called-variant sites (to keep)
  min: "1"
  returntype: -p

Why this shape

You cannot get “regions that are low-coverage-or-failed but not called” from a single overlap. Union (concat+merge) builds the candidate mask; difference (subtract) carves out the variants you are deliberately keeping. Each step is one set operation; the order matters (merge before subtract, so the subtraction operates on clean spans).

Pitfalls

  • zero_regions: true upstream is load-bearing. If the coverage step omits zero-depth rows, fully-uncovered spans never enter the union and the mask misses them. See interval-coverage.
  • Concat is not merge. gops_concat stacks intervals (overlaps intact); without the following gops_merge, the “union” is not actually unioned and the subtract operates on overlapping garbage.
  • Indel span arithmetic. The len(c3) term in the BED construction accounts for multi-base reference alleles; copying a fixed 3-column BED (tabular-synthesize-bed-from-3col) would mis-size indel intervals by one or more bp.
  • gops_* is co-equal, not deprecated. Don’t “modernize” this recipe to bedtools — bedtools has no corpus subtract/concat, and the recipe is attested end-to-end in gops_*. See iwc-interval-operations-survey.

See also

IWC exemplars1 anchor

IWC Exemplars

sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variationhigh

Full concat -> merge -> subtract set-algebra computing the consensus masking regions.

  • Cocatenated low-coverage regions and filter-failed sites
  • Combined low-coverage regions and filter-failed sites
  • Masking regions

Incoming References (5)

  • Galaxy: genomic interval patternsrelated pattern— Use this MOC to choose corpus-grounded Galaxy genomic interval operations and recipes on coordinate features.
  • Interval: compute coveragerelated pattern— Two coverage modes: genome-wide depth as a bedgraph (genomecoveragebed) and reads counted in given regions (coveragebed). Same family, different question.
  • Interval: merge overlapping featuresrelated pattern— Collapse overlapping or book-ended intervals within one set into single spans; bedtools mergebed or the gops_merge Operate-on-Genomic-Intervals tool.
  • Interval: filter or annotate by overlaprelated pattern— Keep, drop, or annotate coordinate features by overlap with a second feature set; bedtools intersect (BED) or vcfvcfintersect (VCF), mapped over a collection.
  • Iwc Interval Operations Surveyrelated note— IWC corpus survey of coordinate-aware genomic interval operations; sizing and candidate boundaries for a galaxy-interval-patterns MOC, with hold-if-thin gate.