Home Pattern

Sequence: merge FASTA and filter unique

Concatenate several FASTA files into one and drop duplicate records by sequence identity in a single step; fasta_merge_files_and_filter_unique_sequences.

Revised
2026-06-10
Rev
1

Pattern health

warn
  • IWC exemplar anchors

    2 abstract workflow anchors declared.

  • Foundry verification fixture

    No structural verification fixture yet.

  • Pattern map coverage

    1 pattern map link here.

  • Metadata contract

    Pattern frontmatter matches the site contract.

Sequence: merge FASTA and filter unique

Operation

Combine several FASTA files into one and drop duplicate records in the same step, where “duplicate” is decided by sequence identity, not header. The corpus uses:

toolshed.g2.bx.psu.edu/repos/galaxyp/fasta_merge_files_and_filter_unique_sequences (“FASTA Merge Files and Filter Unique Sequences”).

The differentiator from a plain text concatenation (cat, tp_cat) is the dedup: the same sequence can arrive under different headers across inputs, and this tool collapses them to one record. That is the reason to reach for it; if you do not need dedup, a concatenation is simpler. Parameter names below are corpus-inferred from tool_state.

When to reach for it

  • Building a non-redundant reference — a proteomics search database, a merged contig/sequence set — from per-source FASTAs where the same sequence recurs (the corpus uses it for clinical-MP search DBs and pathogen-contig aggregation).
  • Pooling a collection of FASTAs into a single file with duplicates removed.

If you want to keep duplicates (e.g. preserving every record for counting), this is the wrong tool — concatenate instead.

Parameters

  • batchmode.processmode: merge — merge all inputs into one output (vs per-input processing).
  • batchmode.input_fastas: the FASTA set (connected; a collection or multiple inputs).
  • uniqueness_criterion: sequence in the corpus — dedup by sequence content. (Dedup-by-header would keep same-sequence/different-header records.)
  • accession_parser: a regex extracting the accession/id from each header for the retained record. Corpus value ^>([^ ]+).*$ — take the first whitespace-delimited token after >.

Idiomatic shape

tool_id: toolshed.g2.bx.psu.edu/repos/galaxyp/fasta_merge_files_and_filter_unique_sequences/fasta_merge_files_and_filter_unique_sequences/1.2.0
tool_state:
  batchmode:
    processmode: merge
    input_fastas: { __class__: ConnectedValue }
  uniqueness_criterion: sequence
  accession_parser: ^>([^ ]+).*$

Pitfalls

  • Dedup is by sequence, not header — confirm that is what you want. Two genuinely distinct records that happen to share a sequence collapse to one; if headers carry meaning you need to keep, dedup elsewhere or preserve provenance first.
  • processmode: merge vs per-input. Merge pools everything into one file. If you wanted one deduped output per input, that is a different mode.
  • accession_parser decides which header survives. A regex that does not match a header shape leaves records mis-parsed; verify it against the actual > lines (the corpus’s ^>([^ ]+).*$ keeps the first token).
  • Not a concatenation substitute. If you do not need dedup, this tool’s parsing and uniqueness machinery is overhead — use a plain concat.

See also

IWC exemplars2 anchors

IWC Exemplars

microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisationhigh

Merge mode over a set of per-sample FASTAs with uniqueness_criterion sequence and an accession_parser regex pulling the id from each header.

proteomics/clinicalmp/clinicalmp-database-generation/iwc-clinicalmp-database-generationhigh

Merge mode building a non-redundant proteomics search database from UniProt, microbial, and cRAP FASTAs.

  • Human UniProt Microbial Proteins cRAP for MetaNovo

Incoming References (3)

  • Galaxy: sequence patternsrelated pattern— Use this MOC to choose corpus-grounded Galaxy operations on sequence records (FASTA) — interconvert, reformat, merge, length, extract/mask by region.
  • Sequence: reformat FASTA line widthrelated pattern— Rewrap FASTA records to a fixed sequence-line width so downstream tools and viewers get canonical 60/70/80-column output; cshl_fasta_formatter.
  • Iwc Sequence Operations Surveyrelated note— IWC survey of record-level FASTA manipulation (interconversion, reformat, merge/dedup, subset, extract-at-intervals); sizes a galaxy-sequence-patterns MOC.