519
edits
Mgautierfr (talk | contribs) (Change spec description to new namespace usage.) |
m (Real zimlib -> libzim fix) |
||
(4 intermediate revisions by 2 users not shown) | |||
Line 29: | Line 29: | ||
|- | |- | ||
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title | | titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title | ||
This is considered as obsolete, readers should use <code>X/listing/titleordered/v0</code> instead and fallback to <code>titlePtrPos</code> if entry is not present. | This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present. | ||
|- | |- | ||
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list | | clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list | ||
Line 75: | Line 73: | ||
The URL pointer list is a list of 8 byte offsets to the directory entries. | The URL pointer list is a list of 8 byte offsets to the directory entries. | ||
The directory entries are always ordered by URL. Ordering is simply done by comparing the URL strings. | The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings. | ||
Since directory entries have variable sizes this is needed for random access. | Since directory entries have variable sizes this is needed for random access. | ||
Line 91: | Line 89: | ||
|} | |} | ||
Libzim caches directory entries and references the cached entries via the URL pointers. | |||
== Title Pointer List (titlePtrPos) == | == Title Pointer List (titlePtrPos) == | ||
The title pointer list is a list of entry indices ordered by title. The title pointer list actually points to entries in the URL pointer list. | The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list. | ||
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers. | Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers. | ||
Line 114: | Line 112: | ||
The indirection from titles via URLs to directory entries has two reasons: | The indirection from titles via URLs to directory entries has two reasons: | ||
* the pointer list is only half in size as 4 bytes are enough for each entry | * the pointer list is only half in size as 4 bytes are enough for each entry | ||
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in | * accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim. | ||
== Directory Entries == | == Directory Entries == | ||
Line 190: | Line 188: | ||
The first byte of the cluster identifies some information about the cluster. | The first byte of the cluster identifies some information about the cluster. | ||
The first fourth low bits identifies if the cluster | The first fourth low bits identifies if the cluster compression type: | ||
* | * No compression is indicated by a value of 1 | ||
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) | * Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression). | ||
* There have been other compression algorithms used before | * There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2. | ||
The | * 0 is an obselete code for no compression (inhereted from the Zeno) | ||
The fifth bit identifies the cluster is extended or not : | |||
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go. | * By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go. | ||
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go. | * If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go. | ||
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended. | A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended. | ||
The | The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java]. | ||
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte. | To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte. | ||
Line 206: | Line 206: | ||
! Field Name !! Type !!Offset!!Length!! Description | ! Field Name !! Type !!Offset!!Length!! Description | ||
|- | |- | ||
| cluster information || integer || 0 || 1 || Fourth low bits : | | cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed | ||
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) | Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) | ||
|- | |- | ||
Line 269: | Line 269: | ||
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget. | If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget. | ||
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to | Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim. | ||
== Encodings == | == Encodings == |