- Source: Zarr (data format)
Zarr is an open standard for storing large multidimensional array data. It specifies a protocol and data format, and is designed to be "cloud ready" including random access, by dividing data into subsets referred to as chunks.
Zarr can be used within many programming languages, including Python, Java, JavaScript, C++ and Julia.
It has been used by organisations such as Google, Microsoft to publish large datasets.
Zarr is designed to support high-throughput distributed I/O on different storage systems, which is a common requirement in cloud computing. Multiple read operations can efficiently occur to a Zarr array in parallel, or multiple write operations in parallel.
Format description
The main data format in Zarr is multidimensional arrays. For parallelisable access, these arrays are stored and accessed as a grid of so-called "chunks". The actual data format on disk depends on the compressor and storage plugins selected by the user.
Zarr's design was influenced by that of HDF5, and so it includes similar features for metadata and grouping: arrays can be grouped into named hierarchies, and they can also be annotated with key-value metadata stored alongside the array.
Applications
For bioimaging such as microscopy, a consortium called the Open Microscopy Environment (OME) created a format called "OME-Zarr", based on Zarr with some discipline-specific extensions.
Similarly, Zarr is being used to publish weather and satellite data and energy data, among others.
See also
NetCDF
Dask
References
External links
Official website