find-test-data
Resolve concrete test data for the workflow’s inputs. Read the interface brief for each input’s Galaxy shape and datatype, and the source summary for the data the source itself names — then search IWC fixtures and public sources for data that matches. Emit test-data-refs.json: one entry per input, each carrying a URL or path plus the expected shape, ready for implement-galaxy-workflow-test to stage.
This Mold is the first leg of the harness’s test-data-resolution branch. It resolves what it can and reports gaps; the harness routes any unresolved input to the user-supplied fallthrough. Deciding to ask the user is a harness concern, not this Mold’s — its job is an honest, source-backed match.
Sequence
- Enumerate inputs and their required shape. From the interface brief, list each workflow input: label, Galaxy collection shape (File / list / paired / list:paired / record), and datatype. This is the target shape every match must satisfy.
- Mine the source summary for named data. The interface and data-flow briefs are design artifacts — they deliberately drop dataset provenance. The source summary (
freeform-summary / summary-nextflow / summary-cwl) is where the source names its data: sample-data locations, accessions, public-data candidates, fallback bundles, and sizing guidance (“one chromosome”, “precomputed count matrix”, “small subset”). Pull every candidate dataset and every data-sizing instruction the source gives.
- Match each named candidate against the required shape — and don’t stop at a shape mismatch. Check each candidate against step 1’s target shape and datatype. A candidate that is the wrong shape (e.g. raw signal / reads named when the input is a count matrix) is not a resolution — but it is also not the end of the search. When the source’s named candidates don’t fit, follow the source’s own guidance to the right-shape public artifact: if the source says the input is a precomputed count matrix, find the canonical public count matrix for that study/domain (GEO/ENCODE/ArrayExpress series, a published supplementary table) rather than reporting “no data.” “Named candidates are the wrong shape” ≠ “no data exists.”
- Search IWC fixtures and public sources. Prefer existing IWC test data for the same domain — it already conforms to iwc-test-data-conventions (remote URL, recorded hash, known collection layout); a near-neighbour IWC
-tests.yml is the strongest source. Otherwise resolve the right-shape public dataset found in step 3, sized for a fast test run.
- “Small” is a documented subset of a real source, not a fabricated stand-in. When the source asks for a small fixture (one chromosome, selected loci, a few samples) and only a full real dataset exists, that input is resolved: record the real source URL plus the data-import-boundary prep needed to reach the small shape (row-subset by key, column/sample split into the collection’s element identifiers). The prep is a note on the ref, not an analysis step and not an excuse to mark the input unresolved. Resolve the data; leave analysis parameters (factors, thresholds, reference levels, top-N) to the design Molds — they are not this Mold’s to decide.
- Emit refs. Write one
test-data-refs.json entry per input: the URL/path, the expected Galaxy shape, datatype, element identifiers when it is a collection, integrity hash when known, and any subset/split prep. Per galaxy-workflow-testability-design, make sure each entry maps to an addressable input label.
- Report genuine gaps only. Mark
resolved: false with a reason only when steps 2–5 turn up no real source of the right shape — not merely because the source’s first-named candidate was the wrong shape or because a real source needs a documented subset. These honest gaps are what the harness hands to user-supplied.
No fabrication
Never invent a URL, accession, or path to make an input look resolved, and never emit a placeholder path (sampleA.tabular, test-data/…) for an input you could not resolve — an unresolved input stays resolved: false all the way through to the test, never papered over with a made-up path downstream. A wrong-but-plausible fixture reference is worse than an honest gap: it survives static checks and fails only at run time, far from this Mold. Every emitted ref must point at data that exists (a real source, optionally plus a reproducible subset/split); everything else is a reported gap.