Difference between revisions of "ZIM file format"

Jump to navigation Jump to search
40 bytes removed ,  07:44, 23 April 2021
→‎Clusters: Clarification around the cluster compression types
(Change spec description to new namespace usage.)
(→‎Clusters: Clarification around the cluster compression types)
(2 intermediate revisions by one other user not shown)
Line 29: Line 29:
|-
|-
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title
This is considered as obsolete, readers should use <code>X/listing/titleordered/v0</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.
 
Always valid for now, but it may be set to 0 in the future if <code>titlePtrPos</code> is not present.                 
|-
|-
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list                 
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list                 
Line 75: Line 73:
The URL pointer list is a list of 8 byte offsets to the directory entries.
The URL pointer list is a list of 8 byte offsets to the directory entries.


The directory entries are always ordered by URL. Ordering is simply done by comparing the URL strings.
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.


Since directory entries have variable sizes this is needed for random access.
Since directory entries have variable sizes this is needed for random access.
Line 94: Line 92:


== Title Pointer List (titlePtrPos) ==
== Title Pointer List (titlePtrPos) ==
The title pointer list is a list of entry indices ordered by title. The title pointer list actually points to entries in the URL pointer list.
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.


Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.
Line 190: Line 188:
The first byte of the cluster identifies some information about the cluster.
The first byte of the cluster identifies some information about the cluster.


The first fourth low bits identifies if the cluster is compressed (4) or not (0):
The first fourth low bits identifies if the cluster compression type:
* The default is uncompressed indicated by a value of 0 or 1 (obsoleted, inherited by Zeno).
* No compression is indicated by a value of 1
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) and 5 (Zstandard compression).
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).
* There have been other compression algorithms used before (2: zlib, 3: bzip2) which have been removed.
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.
The firth bit identifies if the cluster is extended or not :
* 0 is an obselete code for no compression (inhereted from the Zeno)
 
The fifth bit identifies the cluster is extended or not :
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.
Line 206: Line 206:
! Field Name !! Type !!Offset!!Length!! Description                 
! Field Name !! Type !!Offset!!Length!! Description                 
|-
|-
| cluster information || integer || 0 || 1 || Fourth low bits : 0: default (no compression), 1: none (inherited from Zeno), 4: LZMA2 compressed, 5: zstd compressed
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8)               
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8)               
|-
|-

Navigation menu