Difference between revisions of "ZIM file format"

Jump to navigation Jump to search
148 bytes removed ,  10:07, 15 March 2023
m
Real zimlib -> libzim fix
m
m (Real zimlib -> libzim fix)
(3 intermediate revisions by 2 users not shown)
Line 30: Line 30:
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.
Always valid for now, but it may be set to 0 in the future if <code>titlePtrPos</code> is not present.                 
|-
|-
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list                 
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list                 
Line 91: Line 89:
|}
|}


Zimlib caches directory entries and references the cached entries via the URL pointers.
Libzim caches directory entries and references the cached entries via the URL pointers.


== Title Pointer List (titlePtrPos) ==
== Title Pointer List (titlePtrPos) ==
Line 114: Line 112:
The indirection from titles via URLs to directory entries has two reasons:
The indirection from titles via URLs to directory entries has two reasons:
* the pointer list is only half in size as 4 bytes are enough for each entry
* the pointer list is only half in size as 4 bytes are enough for each entry
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in zimlib.
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.


== Directory Entries ==
== Directory Entries ==
Line 190: Line 188:
The first byte of the cluster identifies some information about the cluster.
The first byte of the cluster identifies some information about the cluster.


The first fourth low bits identifies if the cluster is compressed (4) or not (0):
The first fourth low bits identifies if the cluster compression type:
* The default is uncompressed indicated by a value of 0 or 1 (obsoleted, inherited by Zeno).
* No compression is indicated by a value of 1
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) and 5 (Zstandard compression).
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).
* There have been other compression algorithms used before (2: zlib, 3: bzip2) which have been removed.
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.
The firth bit identifies if the cluster is extended or not :
* 0 is an obselete code for no compression (inhereted from the Zeno)
 
The fifth bit identifies the cluster is extended or not :
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.


The zimlib uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].


To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.
Line 206: Line 206:
! Field Name !! Type !!Offset!!Length!! Description                 
! Field Name !! Type !!Offset!!Length!! Description                 
|-
|-
| cluster information || integer || 0 || 1 || Fourth low bits : 0: default (no compression), 1: none (inherited from Zeno), 4: LZMA2 compressed, 5: zstd compressed
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8)               
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8)               
|-
|-
Line 269: Line 269:
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.


Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to zimlib.
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.


== Encodings ==
== Encodings ==

Navigation menu