Agent Skill · cast

find-test-data

Search IWC fixtures and public sources for test data matching a data-flow shape.

Install with Claude Code

/plugin marketplace add galaxyproject/foundry
/plugin install foundry-skills@galaxy-workflow-foundry

Then invoke as:

/foundry-skills:find-test-data

Install with Codex

codex plugin marketplace add galaxyproject/foundry
codex plugin add foundry-skills@galaxy-workflow-foundry

Then select with /skills or invoke explicitly as:

$find-test-data

Skill Bundle

/ packaged cast

attached files: 2
upfront: 0
on demand: 2
cast rev: n/a
validated: 0

Produces: 1 artifact.

Consumes: 6 artifacts.

Artifact Contract

/ skill handoff

Produces

test-data-refs

Test data matched from IWC fixtures or public sources, expressed as URLs/paths plus expected shapes for downstream test authoring.

jsontest-data-refs.json

Raw artifact contract

{
  "id": "test-data-refs",
  "kind": "json",
  "default_filename": "test-data-refs.json",
  "description": "Test data matched from IWC fixtures or public sources, expressed as URLs/paths plus expected shapes for downstream test authoring."
}

Consumes

freeform-summary

Source summary from [[summarize-paper]] / [[interview-to-freeform-summary]]; mine its sample-data, public-data-candidate, accession, and data-sizing guidance — this is where the source's dataset evidence lives (the design briefs strip it).

Raw artifact contract

{
  "id": "freeform-summary",
  "description": "Source summary from [[summarize-paper]] / [[interview-to-freeform-summary]]; mine its sample-data, public-data-candidate, accession, and data-sizing guidance — this is where the source's dataset evidence lives (the design briefs strip it).",
  "producers": [
    "interview-to-freeform-summary",
    "summarize-paper"
  ]
}

summary-nextflow

Source summary from [[summarize-nextflow]]; mine its test_fixtures / sample-data evidence when running the NEXTFLOW → GALAXY pipeline.

Raw artifact contract

{
  "id": "summary-nextflow",
  "description": "Source summary from [[summarize-nextflow]]; mine its test_fixtures / sample-data evidence when running the NEXTFLOW → GALAXY pipeline.",
  "inherited_schema": "[[summary-nextflow]]",
  "producers": [
    "summarize-nextflow"
  ]
}

summary-cwl

Source summary from [[summarize-cwl]]; mine its test-data / sample-data evidence when running the CWL → GALAXY pipeline.

Raw artifact contract

{
  "id": "summary-cwl",
  "description": "Source summary from [[summarize-cwl]]; mine its test-data / sample-data evidence when running the CWL → GALAXY pipeline.",
  "inherited_schema": "[[summary-cwl]]",
  "producers": [
    "summarize-cwl"
  ]
}

freeform-galaxy-interface

Galaxy interface brief from [[freeform-summary-to-galaxy-interface]] pinning input labels, collection shapes, and datatypes for the PAPER / INTERVIEW → GALAXY pipelines.

Raw artifact contract

{
  "id": "freeform-galaxy-interface",
  "description": "Galaxy interface brief from [[freeform-summary-to-galaxy-interface]] pinning input labels, collection shapes, and datatypes for the PAPER / INTERVIEW → GALAXY pipelines.",
  "producers": [
    "freeform-summary-to-galaxy-interface"
  ]
}

nextflow-galaxy-interface

Galaxy interface brief from [[nextflow-summary-to-galaxy-interface]] pinning input labels, collection shapes, and datatypes for the NEXTFLOW → GALAXY pipeline.

Raw artifact contract

{
  "id": "nextflow-galaxy-interface",
  "description": "Galaxy interface brief from [[nextflow-summary-to-galaxy-interface]] pinning input labels, collection shapes, and datatypes for the NEXTFLOW → GALAXY pipeline.",
  "producers": [
    "nextflow-summary-to-galaxy-interface"
  ]
}

cwl-galaxy-interface

Galaxy interface brief from [[cwl-summary-to-galaxy-interface]] pinning input labels, collection shapes, and datatypes for the CWL → GALAXY pipeline.

Raw artifact contract

{
  "id": "cwl-galaxy-interface",
  "description": "Galaxy interface brief from [[cwl-summary-to-galaxy-interface]] pinning input labels, collection shapes, and datatypes for the CWL → GALAXY pipeline.",
  "producers": [
    "cwl-summary-to-galaxy-interface"
  ]
}

Attached Files

/ runtime references

Load on demand

research

galaxy-workflow-testability-design

packaged

Confirm the collection shapes and labels the resolved data must populate so downstream tests can address them.

Trigger: When mapping an input label in the data-flow brief to a concrete file or collection of files.

on-demand runtime verbatim corpus-observed deterministic 11.4 KB

bundle: references/notes/galaxy-workflow-testability-design.md
source: content/research/galaxy-workflow-testability-design.md

Preview md

---
type: research
subtype: component
tags:
  - research/component
  - target/galaxy
status: draft
created: 2026-05-03
revised: 2026-05-06
revision: 2
ai_generated: true
related_notes:
  - "[[iwc-workflow-testability-survey]]"
  - "[[iwc-test-data-conventions]]"
  - "[[planemo-asserts-idioms]]"
  - "[[iwc-shortcuts-anti-patterns]]"
  - "[[planemo-workflow-test-architecture]]"
  - "[[implement-galaxy-workflow-test]]"
  - "[[gxformat2-schema]]"
  - "[[gxformat2-workflow-inputs]]"
  - "[[galaxy-datatypes-conf]]"
summary: "Design guidance for Galaxy workflow inputs, outputs, and checkpoints that make IWC-style workflow tests possible."
---

# Galaxy workflow testability design

Use this note when authoring or translating a Galaxy workflow **before** the `-tests.yml` file exists. It covers workflow structure choices that make later IWC-style tests meaningful: labels, promoted checkpoints, collection identifiers, and fixture-compatible inputs.

This is not a `content/patterns/` page. It is cross-cutting design guidance for Molds that need testable Galaxy workflows. Assertion syntax lives in [[planemo-asserts-idioms]]. Test YAML fixture shapes live in [[iwc-test-data-conventions]]. Accepted shortcut vs smell calls live in [[iwc-shortcuts-anti-patterns]]. Corpus evidence trail lives in [[iwc-workflow-testability-survey]].

## 1. Treat labels as API

Workflow input and output labels are not cosmetic. Planemo and IWC tests address workflow inputs and outputs by label, and the survey found exact label matches for every asserted output across 114 matched workflow/test pairs. A generated workflow should therefore pick stable, descriptive labels before test authoring starts.

Rules:

- Label every output that may need a test assertion.
- Treat input/output renames as breaking changes
...

research

iwc-test-data-conventions

packaged

Match IWC test-data conventions — Zenodo/remote-URL first, SHA-1 integrity, per-input collection layout — when selecting or describing candidate fixtures.

Trigger: When choosing where test data should come from or describing the shape a candidate file must have.

on-demand runtime verbatim corpus-observed deterministic 21.1 KB

bundle: references/notes/iwc-test-data-conventions.md
source: content/research/iwc-test-data-conventions.md

Preview md

---
type: research
subtype: component
tags:
  - research/component
  - target/galaxy
status: draft
created: 2026-04-30
revised: 2026-05-03
revision: 3
ai_generated: true
related_notes:
  - "[[galaxy-workflow-testability-design]]"
  - "[[iwc-shortcuts-anti-patterns]]"
  - "[[planemo-asserts-idioms]]"
  - "[[implement-galaxy-workflow-test]]"
  - "[[tests-format]]"
  - "[[iwc-tabular-operations-survey]]"
summary: "How IWC workflows organize and reference test data — Zenodo-first, SHA-1 integrity, collection shapes, CVMFS gotchas."
---

# IWC test data conventions

Reference for an agent implementing or editing a `<workflow>-tests.yml` in IWC style. All evidence cited from `/Users/jxc755/projects/repositories/workflow-fixtures/iwc-src/workflows/` (raw IWC clone) and `workflows/README.md`. Authoritative spec: [planemo.readthedocs.io/en/latest/test_format.html](https://planemo.readthedocs.io/en/latest/test_format.html). The companion analysis at `/Users/jxc755/projects/repositories/galaxy-brain/vault/projects/workflow_state/skills/COMPONENT_GALAXY_WORKFLOW_TESTING.md` is the synthesized source for several normative claims here.

This note owns **test YAML fixture shapes**. For workflow-structure choices that make those fixtures possible before the test file exists, use [[galaxy-workflow-testability-design]].

## 1. Where does test data live? Remote vs in-repo

Two storage patterns. They mix freely inside one job.

**Remote `location:` (default for any non-trivial input).** Strongly preferred for anything bigger than a toy fixture. Order of preference, observed in the corpus:

- **Zenodo** — overwhelming default, persistent DOI-backed URL.
  - `read-preprocessing/short-read-qc-trimming/short-read-quality-control-and-trimming-tests.yml:13,17` — `https://zenodo.org/records/11484
...

SKILL.md


# find-test-data

Follow the procedure below and use the artifact/reference sections as the runtime contract.

## When To Use

- Search IWC fixtures and public sources for test data matching a data-flow shape.

## Inputs

- Read artifact `freeform-summary`. Produced by `interview-to-freeform-summary`, `summarize-paper`. Source summary from summarize-paper / interview-to-freeform-summary; mine its sample-data, public-data-candidate, accession, and data-sizing guidance — this is where the source's dataset evidence lives (the design briefs strip it).
- Read artifact `summary-nextflow`. Schema: summary-nextflow. Produced by `summarize-nextflow`. Source summary from summarize-nextflow; mine its test_fixtures / sample-data evidence when running the NEXTFLOW → GALAXY pipeline.
- Read artifact `summary-cwl`. Schema: summary-cwl. Produced by `summarize-cwl`. Source summary from summarize-cwl; mine its test-data / sample-data evidence when running the CWL → GALAXY pipeline.
- Read artifact `freeform-galaxy-interface`. Produced by `freeform-summary-to-galaxy-interface`. Galaxy interface brief from freeform-summary-to-galaxy-interface pinning input labels, collection shapes, and datatypes for the PAPER / INTERVIEW → GALAXY pipelines.
- Read artifact `nextflow-galaxy-interface`. Produced by `nextflow-summary-to-galaxy-interface`. Galaxy interface brief from nextflow-summary-to-galaxy-interface pinning input labels, collection shapes, and datatypes for the NEXTFLOW → GALAXY pipeline.
- Read artifact `cwl-galaxy-interface`. Produced by `cwl-summary-to-galaxy-interface`. Galaxy interface brief from cwl-summary-to-galaxy-interface pinning input labels, collection shapes, and datatypes for the CWL → GALAXY pipeline.

## Outputs

- Write artifact `test-data-refs` as `test-data-refs.json`. Format: `json`. Test data matched from IWC fixtures or public sources, expressed as URLs/paths plus expected shapes for downstream test authoring.

## Required Tools

- None declared. Procedure should not assume external CLIs are present.

## Load Upfront

- None declared.

## Load On Demand

- `references/notes/galaxy-workflow-testability-design.md`: Research note copied verbatim into the bundle. Confirm the collection shapes and labels the resolved data must populate so downstream tests can address them. Use when: mapping an input label in the data-flow brief to a concrete file or collection of files.
- `references/notes/iwc-test-data-conventions.md`: Research note copied verbatim into the bundle. Match IWC test-data conventions — Zenodo/remote-URL first, SHA-1 integrity, per-input collection layout — when selecting or describing candidate fixtures. Use when: choosing where test data should come from or describing the shape a candidate file must have.

## Validation

- None declared.

## Procedure

Resolve concrete test data for the workflow's inputs. Read the interface brief for each input's Galaxy shape and datatype, **and the source summary for the data the source itself names** — then search IWC fixtures and public sources for data that matches. Emit `test-data-refs.json`: one entry per input, each carrying a URL or path plus the expected shape, ready for implement-galaxy-workflow-test to stage.

This skill is the first leg of the harness's `test-data-resolution` branch. It resolves what it can and reports gaps; the harness routes any unresolved input to the `user-supplied` fallthrough. Deciding to ask the user is a harness concern, not this skill's — its job is an honest, source-backed match.

### Sequence

1. **Enumerate inputs and their required shape.** From the interface brief, list each workflow input: label, Galaxy collection shape (File / list / paired / list:paired / record), and datatype. This is the *target shape* every match must satisfy.
2. **Mine the source summary for named data.** The interface and data-flow briefs are design artifacts — they deliberately drop dataset provenance. The source summary (`freeform-summary` / `summary-nextflow` / `summary-cwl`) is where the source names its data: sample-data locations, accessions, public-data candidates, fallback bundles, and sizing guidance ("one chromosome", "precomputed count matrix", "small subset"). Pull every candidate dataset and every data-sizing instruction the source gives.
3. **Match each named candidate against the required shape — and don't stop at a shape mismatch.** Check each candidate against step 1's target shape and datatype. A candidate that is the wrong *shape* (e.g. raw signal / reads named when the input is a count matrix) is **not** a resolution — but it is also **not** the end of the search. When the source's named candidates don't fit, follow the source's own guidance to the right-shape public artifact: if the source says the input is a precomputed count matrix, find the canonical public count matrix for that study/domain (GEO/ENCODE/ArrayExpress series, a published supplementary table) rather than reporting "no data." "Named candidates are the wrong shape" ≠ "no data exists."
4. **Search IWC fixtures and public sources.** Prefer existing IWC test data for the same domain — it already conforms to iwc-test-data-conventions (remote URL, recorded hash, known collection layout); a near-neighbour IWC `-tests.yml` is the strongest source. Otherwise resolve the right-shape public dataset found in step 3, sized for a fast test run.
5. **"Small" is a documented subset of a real source, not a fabricated stand-in.** When the source asks for a small fixture (one chromosome, selected loci, a few samples) and only a full real dataset exists, that input is **resolved**: record the real source URL plus the data-import-boundary prep needed to reach the small shape (row-subset by key, column/sample split into the collection's element identifiers). The prep is a note on the ref, not an analysis step and not an excuse to mark the input unresolved. Resolve the *data*; leave analysis parameters (factors, thresholds, reference levels, top-N) to the design skills — they are not this skill's to decide.
6. **Emit refs.** Write one `test-data-refs.json` entry per input: the URL/path, the expected Galaxy shape, datatype, element identifiers when it is a collection, integrity hash when known, and any subset/split prep. Per galaxy-workflow-testability-design, make sure each entry maps to an addressable input label.
7. **Report genuine gaps only.** Mark `resolved: false` with a reason **only** when steps 2–5 turn up no real source of the right shape — not merely because the source's first-named candidate was the wrong shape or because a real source needs a documented subset. These honest gaps are what the harness hands to `user-supplied`.

### No fabrication

Never invent a URL, accession, or path to make an input look resolved, and never emit a placeholder path (`sampleA.tabular`, `test-data/…`) for an input you could not resolve — an unresolved input stays `resolved: false` all the way through to the test, never papered over with a made-up path downstream. A wrong-but-plausible fixture reference is worse than an honest gap: it survives static checks and fails only at run time, far from this skill. Every emitted ref must point at data that exists (a real source, optionally plus a reproducible subset/split); everything else is a reported gap.

## Runtime Notes

- Do not read Foundry source files at runtime; use only files packaged in this skill bundle and user-supplied artifacts.
- Preserve declared artifact filenames unless the user or harness supplies explicit paths.
- Carry unresolved assumptions into the output artifact instead of silently inventing missing source evidence.