Funded by EU Horizon Europe · OSCARS Initiative
Similarity-Preserving Content Codes
for Bioimaging Data
BIOCODES implements the ISO 24138 International Standard Content Code for bioimaging data — enabling integrity verification, deduplication, and cross-platform provenance to make your data AI-ready and FAIR-compliant.
Enhancing AI-Readiness of Bioimaging Data
with Content-Based Identifiers
Challenge
- Growing volume of data: Bio(imaging) data exist at different states — raw, repository, publications — with no shared identity across them.
- No Audit Trail: Challenges to verify data integrity or detect manipulation after the fact.
- Lost Provenance: Published figures are disconnected from raw data and the processing steps that produced them.
Solution
- International Standard Content Code (ISCC ISO 24138). Open and open-source. Interoperable content identification & fingerprinting system. Computed directly from the asset itself — can never be removed or decoupled from the data.
- Anyone can compute ISCCs from available data — independently and without any central authority. Use the ISCC Generator at iscc.io/resources. Learn more
- Generating, signing, and timestamping ISCCs creates persistent identifiers to securely reference and link data in repositories with images in papers.
Scientific Impact
- Cryptographic figure data verification via the ISCC audit trail.
- Improved data integrity, transparency, and reusability.
- Enhanced AI-readiness of bioimaging datasets for trusted AI applications.
ISO 24138
International Standard Content Code
A standardised (ISO 24138) multi-component fingerprint for various media types and file formats. Computed from the asset itself — it can never be removed or decoupled from the data.
Detects conceptually related content
Detects near-duplicate and structurally similar content
Detects exact copies via cryptographic hash
ISCC-ID
A persistent identifier derived from ISCC content codes. ISCC-IDs link raw data, processed derivatives, and published figures into a single auditable chain of provenance.
Helps find metadata even when filenames or paths have changed.
Proves authenticity of the content and its originator.
Demonstrates when content was created or registered.
Enables provenance verification across repositories, publications, and analysis pipelines.
Capabilities
Built for scientific data & bioimaging
Addressing the specific challenges of large-scale imaging data in modern research environments.
ISO 24138:2024 Compliant
Implements the official international standard, ensuring global interoperability across institutions, repositories, and tools.
High Performance
Rust-based engine processes data at 1+ GB/s — up to 184× faster than the pure Python reference implementation, and faster than SHA-256.
Format Agnostic
Works with OME-TIFF, OME-Zarr, CZI, ND2, LIF, DICOM, HDF5 and virtually any binary scientific data format.
FAIR Principles
Supports Findable, Accessible, Interoperable, and Reusable data principles as mandated by EOSC and European funding bodies.
AI-Ready Data
Persistent content-based identifiers survive format conversions and enable reliable provenance tracking for AI training datasets.
Platform Integration
Native plugins for OMERO and Galaxy, with Napari, CellProfiler, and ImageJ integrations in progress.
Open Source Tools
The BIOCODES toolkit
Three complementary tools covering the full bioimaging identification workflow, all Apache 2.0 licensed.
iscc-sum
Stable v0.1High-performance ISCC Data-Code and Instance-Code generation. Single-pass processing with a Rust core and Python bindings — a drop-in replacement for md5sum and sha256sum in scientific pipelines. Faster than SHA-256 at any data size.
pip install iscc-sum
iscc-bio
BetaISCC processing for multi-dimensional bioimage data. Implements the IMAGEWALK specification — deterministic Z→C→T plane traversal for format-agnostic, reproducible content hashing of microscopy volumes.
omero-iscc
AlphaOMERO server integration plugin. Automatically generates and stores ISCC identifiers for images imported into OMERO, enabling facility-level deduplication and FAIR-compliant provenance tracking.
Community & Integrations
Built for the open bioimaging ecosystem
BIOCODES integrates with the tools researchers already use — no new infrastructure required.
OMERO
AvailableServer plugin for automatic ISCC generation on image import. Enables facility-level deduplication and provenance tracking within OMERO's image management infrastructure.
- Automatic ISCC generation on import
- Facility-level deduplication
- FAIR-compliant metadata annotation
Galaxy
AvailableGalaxy tools for ISCC generation, near-duplicate detection, and content verification within reproducible Galaxy workflows. Part of the BMCV galaxy-image-analysis tool suite.
- iscc_sum — ISCC code generation for any file
- iscc_similarity — near-duplicate detection across datasets
- iscc_verify — content integrity verification
Napari
PlannedPlugin integration for interactive image analysis and ISCC annotation within the Napari viewer.
- Interactive ISCC annotation
- In-viewer provenance display
CellProfiler
PlannedPipeline integration for high-content screening datasets, enabling ISCC-based deduplication across large plate acquisitions.
- Pipeline-level ISCC generation
- High-content screening deduplication
ImageJ / Fiji
PlannedDrop-in ISCC checksum support for existing ImageJ-based workflows as a replacement for standard checksum tools.
- Drop-in md5/sha256 replacement
- Macro-scriptable ISCC generation
Team
Project Team
BIOCODES is a collaboration between the ISCC Foundation, Leiden University, and German BioImaging, funded by the European Union.
ISCC Foundation
- Titusz Pan PI tp@iscc.io
- Kira Lemke
- Martin Etzrodt
Leiden University
- Sylvia Le Dévédec PI s.e.ledevedec@lacdr.leidenuniv.nl
- Maarten Paul
- Sebastian Posth
- Joost Willemse
German BioImaging
- Josh Moore
BIOCODES