Difference between revisions of "ZIM file format"

Jump to navigation Jump to search
40 bytes removed ,  10:07, 15 March 2023
m
Real zimlib -> libzim fix
(Change spec description to new namespace usage.)
m (Real zimlib -> libzim fix)
(4 intermediate revisions by 2 users not shown)
Line 29: Line 29:
|-
|-
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title
This is considered as obsolete, readers should use <code>X/listing/titleordered/v0</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.
 
Always valid for now, but it may be set to 0 in the future if <code>titlePtrPos</code> is not present.                 
|-
|-
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list                 
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list                 
Line 75: Line 73:
The URL pointer list is a list of 8 byte offsets to the directory entries.
The URL pointer list is a list of 8 byte offsets to the directory entries.


The directory entries are always ordered by URL. Ordering is simply done by comparing the URL strings.
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.


Since directory entries have variable sizes this is needed for random access.
Since directory entries have variable sizes this is needed for random access.
Line 91: Line 89:
|}
|}


Zimlib caches directory entries and references the cached entries via the URL pointers.
Libzim caches directory entries and references the cached entries via the URL pointers.


== Title Pointer List (titlePtrPos) ==
== Title Pointer List (titlePtrPos) ==
The title pointer list is a list of entry indices ordered by title. The title pointer list actually points to entries in the URL pointer list.
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.


Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.
Line 114: Line 112:
The indirection from titles via URLs to directory entries has two reasons:
The indirection from titles via URLs to directory entries has two reasons:
* the pointer list is only half in size as 4 bytes are enough for each entry
* the pointer list is only half in size as 4 bytes are enough for each entry
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in zimlib.
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.


== Directory Entries ==
== Directory Entries ==
Line 190: Line 188:
The first byte of the cluster identifies some information about the cluster.
The first byte of the cluster identifies some information about the cluster.


The first fourth low bits identifies if the cluster is compressed (4) or not (0):
The first fourth low bits identifies if the cluster compression type:
* The default is uncompressed indicated by a value of 0 or 1 (obsoleted, inherited by Zeno).
* No compression is indicated by a value of 1
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) and 5 (Zstandard compression).
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).
* There have been other compression algorithms used before (2: zlib, 3: bzip2) which have been removed.
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.
The firth bit identifies if the cluster is extended or not :
* 0 is an obselete code for no compression (inhereted from the Zeno)
 
The fifth bit identifies the cluster is extended or not :
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.


The zimlib uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].


To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.
Line 206: Line 206:
! Field Name !! Type !!Offset!!Length!! Description                 
! Field Name !! Type !!Offset!!Length!! Description                 
|-
|-
| cluster information || integer || 0 || 1 || Fourth low bits : 0: default (no compression), 1: none (inherited from Zeno), 4: LZMA2 compressed, 5: zstd compressed
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8)               
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8)               
|-
|-
Line 269: Line 269:
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.


Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to zimlib.
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.


== Encodings ==
== Encodings ==

Navigation menu