Difference between revisions of "ZIM file format"

Jump to navigation Jump to search
Enhance explanations around URLs encoding in ZIM / HTML document
(Remove the idea that titlePtrPos may be set to zero.W)
(Enhance explanations around URLs encoding in ZIM / HTML document)
(18 intermediate revisions by 2 users not shown)
Line 12: Line 12:
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)
|-
|-
|majorVersion
| [[#Major_.26_Minor_versions|majorVersion]]
|integer
| integer
|4
| 4
|2
| 2
|Major version of the ZIM archive format (6)
| Major version of the ZIM archive format. Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)
|-
|-
| minorVersion || integer || 6 || 2 || Minor version of the ZIM archive format (1 for new namespace usage, 0 for old namespace usage)                      
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format. Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1)                      
|-
|-
| uuid || integer || 8 || 16 || unique id of this zim archive                           
| uuid || integer || 8 || 16 || unique id of this zim archive                           
Line 42: Line 42:
|}
|}


Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)
A ZIM archive may be embedded in another file at a specific offset. In the context of the ZIM format, the start of the ZIM header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.


Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1)
=== Major & Minor versions ===


The current major version is 6. You may found old zim archives with major version 5. They are the same than 6 less extended cluster, so you can read a 5 major version as if it was a 6.
Versioning of the file format specification has not been done [https://semver.org/ rigorously] until version 5.


The minor version can be :
Before version 5, there was only one version number and no Major vs Minor distinction.
* 0 : We use the old namespace usage (see [[ZIM file format old namespace]])
 
* 1 : We use the new namespace usage (describe here).
{| class="wikitable"
A zim archive may be embedded in another file at a specific offset. In the context of zim format, the start of the zim header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.
|+ ZIM format versions
|-
! Major !! Minor !! Backward compatible !! Description
!libzim version
|-
| colspan="2" | 0 || no || ''This version features have not been tracked properly''
|Unknown
|-
| colspan="2" | 1 || no || ''This version features have not been tracked properly''
|Unknown
|-
| colspan="2" | 2 || no || ''This version features have not been tracked properly''
|Unknown
|-
| colspan="2" | 3 || no || ''This version features have not been tracked properly''
|Unknown
|-
| colspan="2" | 4 || no ||''This version features have not been tracked properly''
|Unknown
|-
| colspan="2" | 5 || yes || Introduces:
- Url index (was only title indexed before)
 
- MimeList Pos
|From date 2009-11-29
Until 6.3.1 (included)
|-
|5
|0
|yes
|Introduces:
- Notion of Major and Minor version
|From 3.2.0
|-
| rowspan="3" |6|| 0 || no  || Introduces extended clusters
Still uses [[ZIM file format old namespace|"old" namespaces]]
|From 3.2.0
|-
|                    1 || yes  || Introduces [[#Namespaces|"new" namespaces]] scheme
|From 7.0.0
|-
|                    2 || yes || Explicitly allows alias entries (several entries pointing to the same cluster/blob)
|From 9.1.0
|}


== MIME Type List (mimeListPos) ==
== MIME Type List (mimeListPos) ==
Line 89: Line 132:
|}
|}


Zimlib caches directory entries and references the cached entries via the URL pointers.
Libzim caches directory entries and references the cached entries via the URL pointers.


== Title Pointer List (titlePtrPos) ==
== Title Pointer List (titlePtrPos) ==
Line 112: Line 155:
The indirection from titles via URLs to directory entries has two reasons:
The indirection from titles via URLs to directory entries has two reasons:
* the pointer list is only half in size as 4 bytes are enough for each entry
* the pointer list is only half in size as 4 bytes are enough for each entry
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in zimlib.
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.


== Directory Entries ==
== Directory Entries ==
Line 162: Line 205:
| parameter || data || ||see parameter len|| (not used) extra parameters                         
| parameter || data || ||see parameter len|| (not used) extra parameters                         
|}
|}
None of the strings should have control characters from U+0000 through U+001F.


=== Linktarget or deleted Entry (DEPRECATED) ===
=== Linktarget or deleted Entry (DEPRECATED) ===
Line 188: Line 233:
The first byte of the cluster identifies some information about the cluster.
The first byte of the cluster identifies some information about the cluster.


The first fourth low bits identifies if the cluster is compressed (4) or not (0):
The first fourth low bits identifies if the cluster compression type:
* The default is uncompressed indicated by a value of 0 or 1 (obsoleted, inherited by Zeno).
* No compression is indicated by a value of 1
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) and 5 (Zstandard compression).
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).
* There have been other compression algorithms used before (2: zlib, 3: bzip2) which have been removed.
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.
The firth bit identifies if the cluster is extended or not :
* 0 is an obselete code for no compression (inhereted from the Zeno)
 
The fifth bit identifies the cluster is extended or not :
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.


The zimlib uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].


To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.
Line 204: Line 251:
! Field Name !! Type !!Offset!!Length!! Description                 
! Field Name !! Type !!Offset!!Length!! Description                 
|-
|-
| cluster information || integer || 0 || 1 || Fourth low bits : 0: default (no compression), 1: none (inherited from Zeno), 4: LZMA2 compressed, 5: zstd compressed
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8)               
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8)               
|-
|-
Line 251: Line 298:
== URLs ==
== URLs ==


=== URL Encoding ===
=== URL Encoding in the ZIM ===
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)
The URLs in the UrlPointerlist are encoded in utf-8 and are '''not''' url encoded.
 
For instance, if you store in the ZIM an HTML document with a href pointing to `characters%20%C3%A9ncoding.html`, you have to store the corresponding ZIM entry at `characters éncoding.html` URL.
 
Or if you want to store a ZIM entry at `index.html?param=value`, the HTML document pointing to it will have to use the `index.html%3Fparam%3Dvalue` href.
 
The reason behind it is that libzim is agnostic of which kind of content and which kind of readers will be used. Everything around URL encoding is purely linked to HTTP / HTML / Web standards.
 
When serving web content (which is usually the case), some readers process the requests and already do the url decoding internally, whereas most readers will handle the URLs directly.
 
The same applies to querystring which might be absorbed by some webservers and not passed to the libzim.


Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.
In any case, the reader will have to do the HTTP URL decoding before passing the parameter to libzim.


=== Local Anchors ===
=== Local Anchors ===
Many articles - especially when a table of contents is used - use local anchors to jump within an article.   
Many HTML href - especially when a table of contents is used - use local anchors to jump within a document.   


<pre>
<pre>
Line 263: Line 320:
</pre>
</pre>


The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.
When a web browser is used a reader, it handles these local anchors locally client-side. This is never sent to the webserver, and even less to libzim. The browser will determine by itself if another ZIM entry has to be loaded (local anchor inside another document than the currently shown) and will send a request only with the document URL without the local anchor - in our example "foo". After the document has been loaded the browser will then search for the local anchor tag and jump to the right location.


If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.


Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to zimlib.
Should you render the article contents by yourself you have to consider this and take care of it before you hand-out requests to libzim.


== Encodings ==
== Encodings ==
12

edits

Navigation menu