Content Tracking and Verification in Galaxy Workflows with ISCC-SUM

Author(s) orcid logoMaarten Paul avatar Maarten Paulorcid logoMartin Etzrodt avatar Martin Etzrodt
Reviewers Beatriz Serrano-Solano avatarDiana Chiang Jurado avatarPavankumar Videm avatarLeonid Kostrykin avatar
Overview
Creative Commons License: CC-BY Questions:
  • What is an ISCC code and why is it useful for data management?

  • How can I generate content hashes for microscopy data in Galaxy?

  • How can I verify file integrity in my workflows?

  • How can I detect similar content across different files?

Objectives:
  • Understand the purpose and structure of ISCC codes

  • Generate ISCC codes for files at different workflow stages

  • Verify file integrity using ISCC codes

  • Detect content similarity between files

Requirements:
Time estimation: 1 hour
Level: Intermediate Intermediate
Supporting Materials:
Published: Feb 14, 2026
Last modification: Feb 14, 2026
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 1

File content and integrity validation with International Standard Content Code (ISCC)

In scientific workflows, ensuring data integrity and tracking modifications of the data content is crucial for reproducibility. Traditional checksums (like MD5 or SHA) can verify if files are identical, but they cannot detect similar content or survive format conversions.

The International Standard Content Code (ISCC) is a content-derived identifier that provides both:

  • Identity verification: Checksum functionality to verify exact file matches
  • Similarity detection: Ability to detect similar content even across different file formats

The Galaxy ISCC-suite allows you to integrate content tracking into any Galaxy workflow, providing quality control and provenance tracking for your data analysis pipelines.

ISCC Code structure

An ISCC-SUM code is a 55-character identifier with two main components, which are combined into one code:

  • Data-Code: Content-based hash that allows similarity comparison
  • Instance-Code: A fast checksum or cryptographic hash

The Instance-Code uses BLAKE3 hashing, truncated to 64 bits by default. For applications requiring cryptographic-strength verification, ISCC-SUM can output the full 256-bit hash.

For example, the ISCC hash for this file example_image.tiff is:

ISCC:K4AI45QGX6J3LYNEHONZMQT2GJ6YPJDS74EIC2YMSORF4S5H5SKHQQI

This ISCC-CODE is built from two individual components — a Data-Code and an Instance-Code:

    ISCC:GADY45QGX6J3LYNEHONZMQT2GJ6YPZ4BXJQ7ZHWZ7EHRLKANDCSACWI  (Data-Code)
    ISCC:IAD2I4X7BCAWWDETUJPEXJ7MSR4EDZCTTBQS6LQQDWDHS6T4KDDPZ5A  (Instance-Code)

You might notice that the combined ISCC-CODE (K4AI...) looks completely different from these two components. That is expected: the ISCC algorithm takes shortened versions of both hashes, packs them together, and encodes the result as a new string. Think of it like combining two barcodes into a single, shorter barcode — the information is still there, just represented differently.

Files with similar content will have similar Data-Code components, but their Instance-Code will be different. Hence the Instance-Code allows to verify file integrity.

Agenda

In this tutorial, we will deal with:

  1. File content and integrity validation with International Standard Content Code (ISCC)
    1. ISCC Code structure
  2. Prepare your data
    1. Get the data
  3. Generate ISCC codes
  4. Verify file integrity
    1. Manual verification
    2. Workflow integration
    3. Image analysis workflow integration
  5. Detect similar content
    1. Compare two files
    2. Find similar files in collections
  6. Practical use cases
    1. Use case 1: Quality control in image analysis pipelines
    2. Use case 2: Data deduplication and organization
    3. Use case 3: Reproducibility and data sharing
  7. Conclusion
  8. References

Prepare your data

For this tutorial, we’ll use a simple dataset with microscope images that demonstrate different use cases. However, the ISCC SUM tools can ben used for any type of digital content.

Get the data

Hands On: Data Upload
  1. Create a new history for this tutorial in Galaxy.

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Download the following image and import it into your Galaxy history.

    If you are importing the image via URL:

    • Copy the link location
    • Click galaxy-upload Upload at the top of the activity panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

Generate ISCC codes

The first step is generating ISCC codes for your input files. This creates a content fingerprint that can be used for later content-based identification of the file (e.g., within your workflow or within a publication).

Hands On: Generate ISCC codes for input files
  1. Generate ISCC-CODE ( Galaxy version 0.1.0+galaxy1) with the following parameters:
    • param-file “Input File”: Select the first example image (example_image.tiff.)

    Run the tool. This will generate a 55-character ISCC code for the file.

  2. Expand the history item for the output of the Generate ISCC-CODE ( Galaxy version 0.1.0+galaxy1) tool.

  3. Click on the details icon.

  4. Scroll down to the Job Outputs section. Select the Dataset. You should see a single line containing the ISCC code in the output. For the first example image the code is expected to be:
    K4AI45QGX6J3LYNEHONZMQT2GJ6YPJDS74EIC2YMSORF4S5H5SKHQQI 
    
  5. Repeat for the other example images to generate another ISCC code for comparison.
Question
  1. Will the same file always generate the same ISCC code?
  1. Yes! The same file will always generate the identical ISCC code, making it suitable for integrity verification.

Verify file integrity

During workflow execution, you may want to verify that intermediate files match expected content. The Verify ISCC hash tool allows you to check if a file matches a known ISCC code.

Manual verification

Hands On: Verify a file against its ISCC code
  1. Run Verify ISCC-CODE ( Galaxy version 0.1.0+galaxy1) with the following parameters:
    • param-file “Dataset to verify”: Select the first example image
    • “Expected ISCC-CODE”:
      • “Expected ISCC code”: Paste the ISCC code you generated in the previous step
  2. Expand the history item for the output of the Verify ISCC-CODE ( Galaxy version 0.1.0+galaxy1) tool.

  3. Click on the details icon.

  4. Scroll down to the Job Output section. Select the output to expand it, this will show you verification report, that looks like this:
    OK - ISCC-CODEs match
    Expected:  K4AI45QGX6J3LYNEHONZMQT2GJ6YPJDS74EIC2YMSORF4S5H5SKHQQI
    Generated: K4AI45QGX6J3LYNEHONZMQT2GJ6YPJDS74EIC2YMSORF4S5H5SKHQQI
    

    The report shows:

    • Status: OK (match) or FAILED (mismatch)
    • Expected ISCC code
    • Generated ISCC code

Workflow integration

A powerful use case is integrating ISCC verification directly into your workflows. Here we’ll build a simple verification workflow step by step.

Step 1 - Define the workflow inputs

To make the workflow reusable, we need to define two inputs: the image to verify and a file containing the expected ISCC code.

Hands On: Create the workflow inputs
  1. Create a new workflow in the workflow editor.

    1. Click Workflow on the top bar
    2. Click the new workflow galaxy-wf-new button
    3. Give it a clear and memorable name
    4. Clicking Save will take you directly into the workflow editor for that workflow
    5. Need more help? Please see the How to make a workflow subsection here

  2. Select tool Input dataset from the list of tools:
    • param-file 1: Input Dataset appears in your workflow. Change the “Label” of this input to Input image.
  3. Add another tool Input dataset:
    • param-file 2: Input Dataset appears in your workflow. Change the “Label” of this input to Expected ISCC code file.

Step 2 - Parse the expected ISCC code

The Generate ISCC-CODE tool outputs the ISCC code as a text file, but the Verify ISCC-CODE tool expects the code as a parameter input. We use Parse parameter value to bridge this gap.

Hands On: Add the parameter parsing step
  1. While in the workflow editor, add tool Parse parameter value from the list of tools:
    • Connect the output of param-file 2: Expected ISCC code file to the “Input file containing parameter to parse” input of tool 3: Parse parameter value.

Step 3 - Add the verification step

Now we add the ISCC verification tool and connect all the inputs.

Hands On: Add the ISCC verification step
  1. Add Verify ISCC-CODE ( Galaxy version 0.1.0+galaxy1) from the list of tools:
    • Connect the output of param-file 1: Input image to the “Dataset to verify” input of tool 4: Verify ISCC-CODE.
    • Connect the output of tool 3: Parse parameter value to the “File containing expected ISCC code” input of tool 4: Verify ISCC-CODE.

The completed workflow should look like this:

Workflow diagram showing ISCC verification integrated into a simple workflow.

Step 4 - Run the workflow

Hands On: Run the verification workflow
  1. Run the workflow with the following inputs:
    • Input image: Select the first example image (example_image.tiff)
    • Expected ISCC code file: Select the ISCC code output generated in a previous step
  2. Wait for the workflow to complete. Subsequently , expand the history item for the output of the Verify ISCC-CODE ( Galaxy version 0.1.0+galaxy1) tool.

  3. Click on the details icon.

  4. Scroll down to the Job Output section and select the output dataset. You should see a verification report similar to the one described in the manual verification section above.

When placing this verification step in a full workflow, it can help validate that your processing didn’t unexpectedly alter the content.

Image analysis workflow integration

This can be applied in an image analysis workflow to verify an image processing tool provides the expected reproducible output. In the example files we shared, a thresholded image example_thresholded1.tiff can be found. We will use it to verify whether the Otsu threshold result of this image can be reproduced.

Hands On: Image analysis verification workflow
  1. Import and run the ready-to-use workflow:

    Hands On: Importing and launching a GTN workflow
    Launch ISCC Image Analysis Verification (View on GitHub, Download workflow) workflow.
    1. Click on galaxy-workflows-activity Workflows in the Galaxy activity bar (on the left side of the screen, or in the top menu bar of older Galaxy instances). You will see a list of all your workflows
    2. Click on galaxy-upload Import at the top-right of the screen
    3. Paste the following URL into the box labelled “Archived Workflow URL”: https://training.galaxyproject.org/training-material/topics/imaging/tutorials/iscc-suite/workflows/ISCC---image-analysis-workflow-example.ga
    4. Click the Import workflow button

    Below is a short video demonstrating how to import a workflow from GitHub using this procedure:

    Video: Importing a workflow from URL

  2. Provide the inputs:
    • Original image: Select example_image.tiff - the image to be processed
    • Segmented image: Select example_thresholded1.tiff - the reference segmentation to compare against
  3. Run the workflow.

  4. Expand the history item for the output of the Verify ISCC-CODE ( Galaxy version 0.1.0+galaxy1) tool.

  5. Click on the details icon.

  6. Scroll down to the Job Information section to view the “Tool Standard Output” log. You should see a verification report similar to the one described in the manual verification section above.

The workflow performs Otsu thresholding on the original image and verifies whether the result matches the expected segmentation using ISCC codes. This allows you to verify whether the thresholding method is working as expected and the algorithm has not been altered (e.g., in a new version).

Workflow diagram showing ISCC verification in an image analysis pipeline.

Comment: When to use verification

Verification is particularly useful:

  • After file transfers or storage operations
  • To confirm correct input files in complex workflows
  • As quality control checkpoints in processing pipelines
  • To detect unintended data modifications

Detect similar content

One of ISCC’s unique features is detecting similar content, even across different formats. This is useful for finding duplicates, tracking content transformations, or identifying related files.

Compare two files

Hands On: Compare two files for similarity
  1. Find datasets with similar ISCC-CODEs ( Galaxy version 0.1.0+galaxy1) with the following parameters:
    • “Input type”: Datasets to compare
      • param-file Select multiple datasets (or a collection, see below)
    • “Similarity threshold (Hamming distance)”: 12 (default)
  2. The tool will create tabular output which indicates which datasets are similar. The table will list all the files that has been set as input. For the files which have a similar file, below the set threshold, the similar file is listed.

Find similar files in collections

When working with a collection of files, you can identify all similar items. This is particularly useful when you have large datasets and want to find duplicates or track how processing affects content similarity.

Hands On: Find similar files in a collection
  1. Create a dataset collection with your test images

    • Click on galaxy-selector Select Items at the top of the history panel Select Items button
    • Check all the datasets in your history you would like to include
    • Click n of N selected and choose Advanced Build List

      build list collection menu item

    • You are in collection building wizard. Choose Flat List and click ‘Next’ button at the right bottom corner.

      collection building wizard flat list

    • Double clcik on the file names to edit. For example, remove file extensions or common prefix/suffixes to cleanup the names.

      edit and build a list collection

    • Enter a name for your collection
    • Click Build to build your collection
    • Click on the checkmark icon at the top of your history again

    Include all images from the tutorial: example_image.tiff, example_image2.tiff, example_image3.tiff and example_thresholded1.tiff

  2. Find datasets with similar ISCC-CODEs ( Galaxy version 0.1.0+galaxy1) with the following parameters:
    • “Input type”: Datasets to compare
      • param-file Select your collection
    • “Similarity threshold (Hamming distance)”: 12 (default)
  3. Examine the output table. Each row represents a file from your collection. The columns show:
    • The filename
    • Its ISCC code
    • Any similar files found (with their similarity score)

    Files that share content (like example_image.tiff and example_image3.tiff, which is slightly modified ) will be grouped together, although their ISCC-SUM codes are different.

    ISCC similarity table.

The Hamming distance counts how many bits differ between two Data-Codes. For the default 64-bit Data-Code, this ranges from 0 (identical) to 64 (completely different). The tool uses a default threshold of 12, meaning files with a distance of 12 or less are considered similar.

This threshold is a practical starting point — adjust it based on your use case: lower values for stricter matching, higher values to catch more distant similarities. Keep in mind that Data-Code similarity reflects byte-level similarity, not semantic content. Whether a given distance is scientifically meaningful depends on your domain and data.

Question
  1. Looking at the similarity results table, why do example_image.tiff and example_image3.tiff show a match while example_thresholded1.tiff does not?

  2. What does a distance value of -1 indicate?

  1. example_image.tiff and example_image3.tiff contain similar visual content, resulting in a Hamming distance of 5, which is below the threshold of 12. The thresholded image has undergone significant processing (binarization), changing its content substantially so it no longer matches the original within the similarity threshold.

  2. A distance of -1 indicates that no similar file was found within the specified threshold. The file is unique compared to all other files in the collection.

Practical use cases

Use case 1: Quality control in image analysis pipelines

When processing large microscopy datasets:

  • Generate ISCC codes for raw images upon acquisition
  • Detect if processing steps produce consistent outputs across batch runs
  • Identify accidentally duplicated samples before analysis

Use case 2: Data deduplication and organization

When managing growing image repositories:

  • Scan collections to find duplicate uploads that waste storage
  • Identify images that are near-duplicates (e.g., same sample, different export settings)
  • Group related experimental replicates automatically

Use case 3: Reproducibility and data sharing

When publishing or sharing datasets:

  • Include ISCC codes in data publications for recipient verification
  • Document the exact input files used in published analyses
  • Enable collaborators to confirm they have identical source data

Conclusion

In this tutorial, you learned to use the Galaxy ISCC-suite for content tracking and verification:

  • Generate ISCC-CODE: Creates content-based identifiers for any file
  • Verify ISCC-CODE: Confirms files match expected content at workflow checkpoints
  • Find datasets with similar ISCC-CODEs: Detects related or duplicate content in collections

These tools help you maintain data integrity throughout your analysis workflows, from initial data import through to final results.

References

  • ISCC - International Standard Content Code: https://iscc.codes/
  • ISCC-SUM Implementation: https://github.com/iscc/iscc-sum
  • ISCC-SUM Quick Start: https://sum.iscc.codes/quickstart/