Funded by EU Horizon Europe · OSCARS Initiative

Similarity-Preserving Codes
for Bioimaging Data

BIOCODES brings the ISO 24138 International Standard Content Code to bioimaging data. Verify integrity, find duplicates, and trace provenance across platforms — from raw data to publication.

Apache 2.0 · Open Source ISO 24138:2024 FAIR Principles

Enhancing AI-Readiness of Bioimaging Data
with Content-Based Identifiers

Challenge

  • Growing volume of data: Bio(imaging) data exist at different states — raw, repository, publications — with no shared identity across them.
  • No Audit Trail: No reliable way to verify data integrity or detect manipulation after the fact.
  • Lost Provenance: Published figures are disconnected from raw data and the processing steps that produced them.

Solution

  • International Standard Content Code (ISCC ISO 24138). Open and open-source. Interoperable content identification & fingerprinting system. Computed directly from the asset itself — can never be removed or decoupled from the data.
  • Anyone can compute ISCCs from available data — independently and without any central authority. Use the ISCC Generator at iscc.io/resources. Learn more
  • Sign and timestamp ISCCs to create persistent identifiers that securely link repository data with the figures in published papers.

Scientific Impact

  • Cryptographic figure data verification via the ISCC audit trail.
  • Better data integrity and reusability through transparent, verifiable provenance chains.
  • AI-ready bioimaging datasets with verified origins, so AI models can train on data you can trace.

ISO 24138

International Standard Content Code

A standardised (ISO 24138) multi-component fingerprint for various media types and file formats. Computed from the asset itself — it can never be removed or decoupled from the data.

Semantic level

Detects conceptually related content

Syntactic level

Detects near-duplicate and structurally similar content

Data level

Detects exact copies via cryptographic hash

ISCC-ID

A persistent identifier derived from ISCC content codes. ISCC-IDs connect raw data, processed derivatives, and published figures into one auditable provenance chain.

Fingerprinting

Helps find metadata even when filenames or paths have changed.

Digital signing

Proves authenticity of the content and its originator.

Timestamping

Demonstrates when content was created or registered.

Secure linking

Verifies provenance across repositories, publications, and analysis pipelines.

1 Algorithmically generated, reproducible data descriptor. ISCC-CODE 2 Entity owning the ISCC-ID. Actor 3 Timestamp of ISCC-ID creation. Creation Time 4 URL for accessing ISCC-ID metadata. Metadata URL ISCC-ID Persistent Content Identifier

Capabilities

Built for scientific data & bioimaging

Designed for the specific problems of large-scale imaging data in research.

ISO 24138:2024 Compliant

Follows the international standard, so codes generated anywhere are compatible everywhere — across institutions, repositories, and tools.

High Performance

Rust-based engine processes data at 1+ GB/s — up to 184× faster than the pure Python reference implementation, and faster than SHA-256.

Format Agnostic

Works with OME-TIFF, OME-Zarr, CZI, ND2, LIF, DICOM, HDF5 and virtually any binary scientific data format.

FAIR Principles

Meets the Findable, Accessible, Interoperable, and Reusable data requirements from EOSC and European funding bodies.

AI-Ready Data

Content-based identifiers survive format conversions, so provenance stays intact when datasets move into AI training pipelines.

Platform Integration

Native plugins for OMERO and Galaxy, with Napari, CellProfiler, and ImageJ integrations in progress.

Open Source Tools

The BIOCODES toolkit

Three complementary tools covering the full bioimaging identification workflow, all Apache 2.0 licensed.

iscc-sum

Stable v0.1

High-performance ISCC Data-Code and Instance-Code generation. Single-pass processing with a Rust core and Python bindings — a drop-in replacement for md5sum and sha256sum in scientific pipelines. Faster than SHA-256 at any data size.

pip install iscc-sum
Platforms: Linux macOS Windows
Formats: Zarr HDF5 OME-TIFF NGFF
Rust + Python CLI Apache 2.0

iscc-bio

Beta

ISCC processing for multi-dimensional bioimage data. Implements the IMAGEWALK specification — deterministic Z→C→T plane traversal for format-agnostic, reproducible content hashing of microscopy volumes.

Platforms: Linux macOS Windows
Formats: OME-TIFF OME-Zarr CZI ND2 LIF DICOM HDF5
Python OME-TIFF OME-Zarr CZI / ND2 / LIF Apache 2.0

omero-iscc

Alpha

OMERO server plugin. Generates and stores ISCC identifiers automatically on image import, so facilities can deduplicate and track provenance without extra steps.

Platforms: Linux macOS
OMERO Plugin Python Server Apache 2.0

Community & Integrations

Built for the open bioimaging ecosystem

BIOCODES integrates with the tools researchers already use — no new infrastructure required.

OMERO

Available

Server plugin that generates ISCCs automatically on image import. Facilities get deduplication and provenance tracking without changing their existing OMERO workflows.

  • Automatic ISCC generation on import
  • Facility-level deduplication
  • FAIR-compliant metadata annotation
omero-iscc

Galaxy

Available

Galaxy tools for ISCC generation, near-duplicate detection, and content verification within reproducible Galaxy workflows. Part of the BMCV galaxy-image-analysis tool suite.

  • iscc_sum — ISCC code generation for any file
  • iscc_similarity — near-duplicate detection across datasets
  • iscc_verify — content integrity verification
galaxy-image-analysis / iscc-sum

Napari

Planned

Plugin integration for interactive image analysis and ISCC annotation within the Napari viewer.

  • Interactive ISCC annotation
  • In-viewer provenance display

CellProfiler

Planned

Pipeline integration for high-content screening datasets. Deduplicates across large plate acquisitions using ISCC codes.

  • Pipeline-level ISCC generation
  • High-content screening deduplication

ImageJ / Fiji

Planned

Drop-in ISCC checksum support for existing ImageJ-based workflows as a replacement for standard checksum tools.

  • Drop-in md5/sha256 replacement
  • Macro-scriptable ISCC generation

Team

Project Team

BIOCODES is a collaboration between the ISCC Foundation, Leiden University, and German BioImaging, funded by the European Union.

ISCC Foundation

  • Titusz Pan PI tp@iscc.io
  • Kira Lemke
  • Martin Etzrodt

Leiden University

German BioImaging

  • Josh Moore