Sequence: relabel FASTA headers via tabular

Use this recipe when the change you need is to the FASTA header line — inject a sample id, strip or rewrite a prefix, normalize accession formatting — and doing it on raw FASTA is awkward. The corpus’s move is to detour through tabular: open the records so the header is column 1, rewrite that column with ordinary text processing, and close them back to FASTA. It is the highest-value sequence construct in IWC, recurring across the pathogen-identification and metagenomic-gene-catalogue families.

This is a sequence-record cousin of regex-relabel-via-tabular — that pattern relabels collection identifiers; this one relabels the header inside each sequence record. Same instinct (let tabular tools do the string work), different target.

The shape

fasta2tab → tp_find_and_replace on column 1 → tab2fasta. Three steps, the first and last being sequence-fasta-tabular-interconvert.

1. Open records to a table

toolshed.g2.bx.psu.edu/repos/devteam/fasta_to_tabular/fasta2tab with descr_columns: "1", keep_first: "0" — header becomes column 1, full sequence column 2 (Gene-based-Pathogen-Identification, step “sample_specific_contigs_tabular_file_preparation”).

2. Rewrite the header column

toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace scoped to column: "1" so only the header is touched, the sequence in column 2 left intact. In the corpus the replacement is a regex over the whole header with a connected replace_pattern — the per-sample id is computed upstream (via compose_text_param) and injected into every header:

tool_id: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace
tool_state:
  infile: { __class__: ConnectedValue }      # the fasta2tab table
  find_and_replace:
    - find_pattern: ^(.+)$
      replace_pattern: { __class__: ConnectedValue }   # computed per-sample id
      is_regex: true
      global: true
      searchwhere: { searchwhere_select: column, column: "1" }

The searchwhere_select: column + column: "1" pin is load-bearing — without it the find/replace would also rewrite the sequence in column 2.

3. Close the table back to FASTA

toolshed.g2.bx.psu.edu/repos/devteam/tabular_to_fasta/tab2fasta on the edited table (Gene-based-Pathogen-Identification, step “contigs”). Headers carry the new label; sequences are unchanged.

Why this shape

FASTA headers are interleaved with sequence lines, so a naive find/replace over the whole file risks matching sequence content and cannot easily target “the header only.” Splitting to two columns makes the header an addressable column; searchwhere column: "1" then guarantees the sequence is never touched. The round-trip is the price of that safety, and it is cheap.

Pitfalls

Scope the find/replace to column 1. Dropping searchwhere column: "1" lets the pattern hit the sequence column and silently corrupt bases. This is the one mistake that turns the recipe from safe to dangerous.
Keep the interconversion symmetric. descr_columns: "1" out, single-header-column in. An asymmetric split (see sequence-fasta-tabular-interconvert) rebuilds malformed headers.
A connected replace_pattern means the id is computed upstream. The corpus builds the per-sample string with compose_text_param before this step; if you hard-code the replacement you lose the per-sample parameterization that makes the recipe reusable across a collection.
Don’t reach for this to wrap lines or filter. Header editing only — width rewrap is sequence-reformat-line-width, record dedup is sequence-merge-and-dedup.

Sequence: relabel FASTA headers via tabular

Pattern health

Sequence: relabel FASTA headers via tabular

The shape

1. Open records to a table

2. Rewrite the header column

3. Close the table back to FASTA

Why this shape

Pitfalls

See also

IWC Exemplars

Incoming References (3)