Funded by EU Horizon Europe · OSCARS Initiative

Similarity-Preserving Content Codes
for Bioimaging Data

BIOCODES implements the ISO 24138 International Standard Content Code for bioimaging data — enabling integrity verification, deduplication, and cross-platform provenance to make your data AI-ready and FAIR-compliant.

Apache 2.0 · Open Source ISO 24138:2024 FAIR Principles

Enhancing AI-Readiness of Bioimaging Data
with Content-Based Identifiers

Challenge

  • Growing volume of data: Bio(imaging) data exist at different states — raw, repository, publications — with no shared identity across them.
  • No Audit Trail: Challenges to verify data integrity or detect manipulation after the fact.
  • Lost Provenance: Published figures are disconnected from raw data and the processing steps that produced them.

Solution

  • International Standard Content Code (ISCC ISO 24138). Open and open-source. Interoperable content identification & fingerprinting system. Computed directly from the asset itself — can never be removed or decoupled from the data.
  • Anyone can compute ISCCs from available data — independently and without any central authority. Use the ISCC Generator at iscc.io/resources. Learn more
  • Generating, signing, and timestamping ISCCs creates persistent identifiers to securely reference and link data in repositories with images in papers.

Scientific Impact

  • Cryptographic figure data verification via the ISCC audit trail.
  • Improved data integrity, transparency, and reusability.
  • Enhanced AI-readiness of bioimaging datasets for trusted AI applications.

ISO 24138

International Standard Content Code

A standardised (ISO 24138) multi-component fingerprint for various media types and file formats. Computed from the asset itself — it can never be removed or decoupled from the data.

Semantic level

Detects conceptually related content

Syntactic level

Detects near-duplicate and structurally similar content

Data level

Detects exact copies via cryptographic hash

ISCC-ID

A persistent identifier derived from ISCC content codes. ISCC-IDs link raw data, processed derivatives, and published figures into a single auditable chain of provenance.

Fingerprinting

Helps find metadata even when filenames or paths have changed.

Digital signing

Proves authenticity of the content and its originator.

Timestamping

Demonstrates when content was created or registered.

Secure linking

Enables provenance verification across repositories, publications, and analysis pipelines.

1 Algorithmically generated, reproducible data descriptor. ISCC-CODE 2 Entity owning the ISCC-ID. Actor 3 Timestamp of ISCC-ID creation. Creation Time 4 URL for accessing ISCC-ID metadata. Metadata URL ISCC-ID Persistent Content Identifier

Capabilities

Built for scientific data & bioimaging

Addressing the specific challenges of large-scale imaging data in modern research environments.

ISO 24138:2024 Compliant

Implements the official international standard, ensuring global interoperability across institutions, repositories, and tools.

High Performance

Rust-based engine processes data at 1+ GB/s — up to 184× faster than the pure Python reference implementation, and faster than SHA-256.

Format Agnostic

Works with OME-TIFF, OME-Zarr, CZI, ND2, LIF, DICOM, HDF5 and virtually any binary scientific data format.

FAIR Principles

Supports Findable, Accessible, Interoperable, and Reusable data principles as mandated by EOSC and European funding bodies.

AI-Ready Data

Persistent content-based identifiers survive format conversions and enable reliable provenance tracking for AI training datasets.

Platform Integration

Native plugins for OMERO and Galaxy, with Napari, CellProfiler, and ImageJ integrations in progress.

Open Source Tools

The BIOCODES toolkit

Three complementary tools covering the full bioimaging identification workflow, all Apache 2.0 licensed.

iscc-sum

Stable v0.1

High-performance ISCC Data-Code and Instance-Code generation. Single-pass processing with a Rust core and Python bindings — a drop-in replacement for md5sum and sha256sum in scientific pipelines. Faster than SHA-256 at any data size.

pip install iscc-sum
Platforms: Linux macOS Windows
Formats: Zarr HDF5 OME-TIFF NGFF
Rust + Python CLI Apache 2.0

iscc-bio

Beta

ISCC processing for multi-dimensional bioimage data. Implements the IMAGEWALK specification — deterministic Z→C→T plane traversal for format-agnostic, reproducible content hashing of microscopy volumes.

Platforms: Linux macOS Windows
Formats: OME-TIFF OME-Zarr CZI ND2 LIF DICOM HDF5
Python OME-TIFF OME-Zarr CZI / ND2 / LIF Apache 2.0

omero-iscc

Alpha

OMERO server integration plugin. Automatically generates and stores ISCC identifiers for images imported into OMERO, enabling facility-level deduplication and FAIR-compliant provenance tracking.

Platforms: Linux macOS
OMERO Plugin Python Server Apache 2.0

Community & Integrations

Built for the open bioimaging ecosystem

BIOCODES integrates with the tools researchers already use — no new infrastructure required.

OMERO

Available

Server plugin for automatic ISCC generation on image import. Enables facility-level deduplication and provenance tracking within OMERO's image management infrastructure.

  • Automatic ISCC generation on import
  • Facility-level deduplication
  • FAIR-compliant metadata annotation
omero-iscc

Galaxy

Available

Galaxy tools for ISCC generation, near-duplicate detection, and content verification within reproducible Galaxy workflows. Part of the BMCV galaxy-image-analysis tool suite.

  • iscc_sum — ISCC code generation for any file
  • iscc_similarity — near-duplicate detection across datasets
  • iscc_verify — content integrity verification
galaxy-image-analysis / iscc-sum

Napari

Planned

Plugin integration for interactive image analysis and ISCC annotation within the Napari viewer.

  • Interactive ISCC annotation
  • In-viewer provenance display

CellProfiler

Planned

Pipeline integration for high-content screening datasets, enabling ISCC-based deduplication across large plate acquisitions.

  • Pipeline-level ISCC generation
  • High-content screening deduplication

ImageJ / Fiji

Planned

Drop-in ISCC checksum support for existing ImageJ-based workflows as a replacement for standard checksum tools.

  • Drop-in md5/sha256 replacement
  • Macro-scriptable ISCC generation

Team

Project Team

BIOCODES is a collaboration between the ISCC Foundation, Leiden University, and German BioImaging, funded by the European Union.

ISCC Foundation

  • Titusz Pan PI tp@iscc.io
  • Kira Lemke
  • Martin Etzrodt

Leiden University

German BioImaging

  • Josh Moore