Funded by EU Horizon Europe · OSCARS Initiative
Similarity-Preserving Codes
for Bioimaging Data
BIOCODES brings the ISO 24138 International Standard Content Code to bioimaging data. Verify integrity, find duplicates, and trace provenance across platforms — from raw data to publication.
Enhancing AI-Readiness of Bioimaging Data
with Content-Based Identifiers
Challenge
- Growing volume of data: Bio(imaging) data exist at different states — raw, repository, publications — with no shared identity across them.
- No Audit Trail: No reliable way to verify data integrity or detect manipulation after the fact.
- Lost Provenance: Published figures are disconnected from raw data and the processing steps that produced them.
Solution
- International Standard Content Code (ISCC ISO 24138). Open and open-source. Interoperable content identification & fingerprinting system. Computed directly from the asset itself — can never be removed or decoupled from the data.
- Anyone can compute ISCCs from available data — independently and without any central authority. Use the ISCC Generator at iscc.io/resources. Learn more
- Sign and timestamp ISCCs to create persistent identifiers that securely link repository data with the figures in published papers.
Scientific Impact
- Cryptographic figure data verification via the ISCC audit trail.
- Better data integrity and reusability through transparent, verifiable provenance chains.
- AI-ready bioimaging datasets with verified origins, so AI models can train on data you can trace.
ISO 24138
International Standard Content Code
A standardised (ISO 24138) multi-component fingerprint for various media types and file formats. Computed from the asset itself — it can never be removed or decoupled from the data.
Detects conceptually related content
Detects near-duplicate and structurally similar content
Detects exact copies via cryptographic hash
ISCC-ID
A persistent identifier derived from ISCC content codes. ISCC-IDs connect raw data, processed derivatives, and published figures into one auditable provenance chain.
Helps find metadata even when filenames or paths have changed.
Proves authenticity of the content and its originator.
Demonstrates when content was created or registered.
Verifies provenance across repositories, publications, and analysis pipelines.
Capabilities
Built for scientific data & bioimaging
Designed for the specific problems of large-scale imaging data in research.
ISO 24138:2024 Compliant
Follows the international standard, so codes generated anywhere are compatible everywhere — across institutions, repositories, and tools.
High Performance
Rust-based engine processes data at 1+ GB/s — up to 184× faster than the pure Python reference implementation, and faster than SHA-256.
Format Agnostic
Works with OME-TIFF, OME-Zarr, CZI, ND2, LIF, DICOM, HDF5 and virtually any binary scientific data format.
FAIR Principles
Meets the Findable, Accessible, Interoperable, and Reusable data requirements from EOSC and European funding bodies.
AI-Ready Data
Content-based identifiers survive format conversions, so provenance stays intact when datasets move into AI training pipelines.
Platform Integration
Native plugins for OMERO and Galaxy, with Napari, CellProfiler, and ImageJ integrations in progress.
Open Source Tools
The BIOCODES toolkit
Three complementary tools covering the full bioimaging identification workflow, all Apache 2.0 licensed.
iscc-sum
Stable v0.1High-performance ISCC Data-Code and Instance-Code generation. Single-pass processing with a Rust core and Python bindings — a drop-in replacement for md5sum and sha256sum in scientific pipelines. Faster than SHA-256 at any data size.
pip install iscc-sum
iscc-bio
BetaISCC processing for multi-dimensional bioimage data. Implements the IMAGEWALK specification — deterministic Z→C→T plane traversal for format-agnostic, reproducible content hashing of microscopy volumes.
omero-iscc
AlphaOMERO server plugin. Generates and stores ISCC identifiers automatically on image import, so facilities can deduplicate and track provenance without extra steps.
Community & Integrations
Built for the open bioimaging ecosystem
BIOCODES integrates with the tools researchers already use — no new infrastructure required.
OMERO
AvailableServer plugin that generates ISCCs automatically on image import. Facilities get deduplication and provenance tracking without changing their existing OMERO workflows.
- Automatic ISCC generation on import
- Facility-level deduplication
- FAIR-compliant metadata annotation
Galaxy
AvailableGalaxy tools for ISCC generation, near-duplicate detection, and content verification within reproducible Galaxy workflows. Part of the BMCV galaxy-image-analysis tool suite.
- iscc_sum — ISCC code generation for any file
- iscc_similarity — near-duplicate detection across datasets
- iscc_verify — content integrity verification
Napari
PlannedPlugin integration for interactive image analysis and ISCC annotation within the Napari viewer.
- Interactive ISCC annotation
- In-viewer provenance display
CellProfiler
PlannedPipeline integration for high-content screening datasets. Deduplicates across large plate acquisitions using ISCC codes.
- Pipeline-level ISCC generation
- High-content screening deduplication
ImageJ / Fiji
PlannedDrop-in ISCC checksum support for existing ImageJ-based workflows as a replacement for standard checksum tools.
- Drop-in md5/sha256 replacement
- Macro-scriptable ISCC generation
Team
Project Team
BIOCODES is a collaboration between the ISCC Foundation, Leiden University, and German BioImaging, funded by the European Union.
ISCC Foundation
- Titusz Pan PI tp@iscc.io
- Kira Lemke
- Martin Etzrodt
Leiden University
- Sylvia Le Dévédec PI s.e.ledevedec@lacdr.leidenuniv.nl
- Maarten Paul
- Sebastian Posth
- Joost Willemse
German BioImaging
- Josh Moore
BIOCODES