Home Mold

find-test-data

Search IWC fixtures and public sources for test data matching a data-flow shape.

draft mold

Mold health

error
  • Source layout

    1 non-index Markdown file with frontmatter.

  • Axis fields

    generic fields are coherent.

  • Eval plan

    Abstract oracle: fixture-independent property checks any run must satisfy.

    eval.md declares properties and check type.

    eval.md ↗
  • Scenarios

    Concrete cases: fixtures bound to expected values, run against the eval properties.

    scenarios.md declares cases bound to fixtures.

    scenarios.md ↗
  • Typed refs

    2 typed references; 0 resolver issues.

  • On-demand triggers

    All on-demand references describe triggers.

  • Evidence checks

    Hypothesis references include verification.

axis
generic
name
find-test-data
contract

Reference Loading

Typed Mold references describe what casting consumes and when the generated skill should load each artifact.

Researchiwc-test-data-conventions

Background synthesis loaded by explicit progressive-disclosure metadata.

Purpose
Match IWC test-data conventions — Zenodo/remote-URL first, SHA-1 integrity, per-input collection layout — when selecting or describing candidate fixtures.
Trigger
When choosing where test data should come from or describing the shape a candidate file must have.

Cast artifacts

  • Claude skillfind-test-data— Search IWC fixtures and public sources for test data matching a data-flow shape.

How to install →

Artifact handoffs

/ pipeline contract

Produces

Consumes

find-test-data

Resolve concrete test data for the workflow’s inputs. Read the interface brief for each input’s Galaxy shape and datatype, and the source summary for the data the source itself names — then search IWC fixtures and public sources for data that matches. Emit test-data-refs.json: one entry per input, each carrying a URL or path plus the expected shape, ready for implement-galaxy-workflow-test to stage.

This Mold is the first leg of the harness’s test-data-resolution branch. It resolves what it can and reports gaps; the harness routes any unresolved input to the user-supplied fallthrough. Deciding to ask the user is a harness concern, not this Mold’s — its job is an honest, source-backed match.

Sequence

  1. Enumerate inputs and their required shape. From the interface brief, list each workflow input: label, Galaxy collection shape (File / list / paired / list:paired / record), and datatype. This is the target shape every match must satisfy.
  2. Mine the source summary for named data. The interface and data-flow briefs are design artifacts — they deliberately drop dataset provenance. The source summary (freeform-summary / summary-nextflow / summary-cwl) is where the source names its data: sample-data locations, accessions, public-data candidates, fallback bundles, and sizing guidance (“one chromosome”, “precomputed count matrix”, “small subset”). Pull every candidate dataset and every data-sizing instruction the source gives.
  3. Match each named candidate against the required shape — and don’t stop at a shape mismatch. Check each candidate against step 1’s target shape and datatype. A candidate that is the wrong shape (e.g. raw signal / reads named when the input is a count matrix) is not a resolution — but it is also not the end of the search. When the source’s named candidates don’t fit, follow the source’s own guidance to the right-shape public artifact: if the source says the input is a precomputed count matrix, find the canonical public count matrix for that study/domain (GEO/ENCODE/ArrayExpress series, a published supplementary table) rather than reporting “no data.” “Named candidates are the wrong shape” ≠ “no data exists.”
  4. Search IWC fixtures and public sources. Prefer existing IWC test data for the same domain — it already conforms to iwc-test-data-conventions (remote URL, recorded hash, known collection layout); a near-neighbour IWC -tests.yml is the strongest source. Otherwise resolve the right-shape public dataset found in step 3, sized for a fast test run.
  5. “Small” is a documented subset of a real source, not a fabricated stand-in. When the source asks for a small fixture (one chromosome, selected loci, a few samples) and only a full real dataset exists, that input is resolved: record the real source URL plus the data-import-boundary prep needed to reach the small shape (row-subset by key, column/sample split into the collection’s element identifiers). The prep is a note on the ref, not an analysis step and not an excuse to mark the input unresolved. Resolve the data; leave analysis parameters (factors, thresholds, reference levels, top-N) to the design Molds — they are not this Mold’s to decide.
  6. Emit refs. Write one test-data-refs.json entry per input: the URL/path, the expected Galaxy shape, datatype, element identifiers when it is a collection, integrity hash when known, and any subset/split prep. Per galaxy-workflow-testability-design, make sure each entry maps to an addressable input label.
  7. Report genuine gaps only. Mark resolved: false with a reason only when steps 2–5 turn up no real source of the right shape — not merely because the source’s first-named candidate was the wrong shape or because a real source needs a documented subset. These honest gaps are what the harness hands to user-supplied.

No fabrication

Never invent a URL, accession, or path to make an input look resolved, and never emit a placeholder path (sampleA.tabular, test-data/…) for an input you could not resolve — an unresolved input stays resolved: false all the way through to the test, never papered over with a made-up path downstream. A wrong-but-plausible fixture reference is worse than an honest gap: it survives static checks and fails only at run time, far from this Mold. Every emitted ref must point at data that exists (a real source, optionally plus a reproducible subset/split); everything else is a reported gap.

Incoming References (3)

  • INTERVIEW → GALAXYphase of pipeline— Interview-driven path to a Galaxy gxformat2 workflow through the shared freeform-summary handoff.
  • PAPER → CWLphase of pipeline— Direct path from a paper to a CWL Workflow + CommandLineTool set.
  • PAPER → GALAXYphase of pipeline— Direct path from a paper to a Galaxy gxformat2 workflow. No CWL intermediate.