Sequence: merge FASTA and filter unique

Operation

Combine several FASTA files into one and drop duplicate records in the same step, where “duplicate” is decided by sequence identity, not header. The corpus uses:

toolshed.g2.bx.psu.edu/repos/galaxyp/fasta_merge_files_and_filter_unique_sequences (“FASTA Merge Files and Filter Unique Sequences”).

The differentiator from a plain text concatenation (cat, tp_cat) is the dedup: the same sequence can arrive under different headers across inputs, and this tool collapses them to one record. That is the reason to reach for it; if you do not need dedup, a concatenation is simpler. Parameter names below are corpus-inferred from tool_state.

When to reach for it

Building a non-redundant reference — a proteomics search database, a merged contig/sequence set — from per-source FASTAs where the same sequence recurs (the corpus uses it for clinical-MP search DBs and pathogen-contig aggregation).
Pooling a collection of FASTAs into a single file with duplicates removed.

If you want to keep duplicates (e.g. preserving every record for counting), this is the wrong tool — concatenate instead.

Parameters

batchmode.processmode: merge — merge all inputs into one output (vs per-input processing).
batchmode.input_fastas: the FASTA set (connected; a collection or multiple inputs).
uniqueness_criterion: sequence in the corpus — dedup by sequence content. (Dedup-by-header would keep same-sequence/different-header records.)
accession_parser: a regex extracting the accession/id from each header for the retained record. Corpus value ^>([^ ]+).*$ — take the first whitespace-delimited token after >.

Idiomatic shape

tool_id: toolshed.g2.bx.psu.edu/repos/galaxyp/fasta_merge_files_and_filter_unique_sequences/fasta_merge_files_and_filter_unique_sequences/1.2.0
tool_state:
  batchmode:
    processmode: merge
    input_fastas: { __class__: ConnectedValue }
  uniqueness_criterion: sequence
  accession_parser: ^>([^ ]+).*$

Pitfalls

Dedup is by sequence, not header — confirm that is what you want. Two genuinely distinct records that happen to share a sequence collapse to one; if headers carry meaning you need to keep, dedup elsewhere or preserve provenance first.
processmode: merge vs per-input. Merge pools everything into one file. If you wanted one deduped output per input, that is a different mode.
accession_parser decides which header survives. A regex that does not match a header shape leaves records mis-parsed; verify it against the actual > lines (the corpus’s ^>([^ ]+).*$ keeps the first token).
Not a concatenation substitute. If you do not need dedup, this tool’s parsing and uniqueness machinery is overhead — use a plain concat.

Sequence: merge FASTA and filter unique

Pattern health

Sequence: merge FASTA and filter unique

Operation

When to reach for it

Parameters

Idiomatic shape

Pitfalls

See also

IWC Exemplars

Incoming References (3)