Difference between revisions of "ZIM file format"

Jump to navigation Jump to search
Enhance explanations around URLs encoding in ZIM / HTML document
m (Real zimlib -> libzim fix)
(Enhance explanations around URLs encoding in ZIM / HTML document)
(15 intermediate revisions by 2 users not shown)
Line 12: Line 12:
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)
|-
|-
|majorVersion
| [[#Major_.26_Minor_versions|majorVersion]]
|integer
| integer
|4
| 4
|2
| 2
|Major version of the ZIM archive format (6)
| Major version of the ZIM archive format. Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)
|-
|-
| minorVersion || integer || 6 || 2 || Minor version of the ZIM archive format (1 for new namespace usage, 0 for old namespace usage)                      
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format. Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1)                      
|-
|-
| uuid || integer || 8 || 16 || unique id of this zim archive                           
| uuid || integer || 8 || 16 || unique id of this zim archive                           
Line 42: Line 42:
|}
|}


Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)
A ZIM archive may be embedded in another file at a specific offset. In the context of the ZIM format, the start of the ZIM header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.


Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1)
=== Major & Minor versions ===


The current major version is 6. You may found old zim archives with major version 5. They are the same than 6 less extended cluster, so you can read a 5 major version as if it was a 6.
Versioning of the file format specification has not been done [https://semver.org/ rigorously] until version 5.


The minor version can be :
Before version 5, there was only one version number and no Major vs Minor distinction.
* 0 : We use the old namespace usage (see [[ZIM file format old namespace]])
 
* 1 : We use the new namespace usage (describe here).
{| class="wikitable"
A zim archive may be embedded in another file at a specific offset. In the context of zim format, the start of the zim header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.
|+ ZIM format versions
|-
! Major !! Minor !! Backward compatible !! Description
!libzim version
|-
| colspan="2" | 0 || no || ''This version features have not been tracked properly''
|Unknown
|-
| colspan="2" | 1 || no || ''This version features have not been tracked properly''
|Unknown
|-
| colspan="2" | 2 || no || ''This version features have not been tracked properly''
|Unknown
|-
| colspan="2" | 3 || no || ''This version features have not been tracked properly''
|Unknown
|-
| colspan="2" | 4 || no ||''This version features have not been tracked properly''
|Unknown
|-
| colspan="2" | 5 || yes || Introduces:
- Url index (was only title indexed before)
 
- MimeList Pos
|From date 2009-11-29
Until 6.3.1 (included)
|-
|5
|0
|yes
|Introduces:
- Notion of Major and Minor version
|From 3.2.0
|-
| rowspan="3" |6|| 0 || no  || Introduces extended clusters
Still uses [[ZIM file format old namespace|"old" namespaces]]
|From 3.2.0
|-
|                    1 || yes  || Introduces [[#Namespaces|"new" namespaces]] scheme
|From 7.0.0
|-
|                    2 || yes || Explicitly allows alias entries (several entries pointing to the same cluster/blob)
|From 9.1.0
|}


== MIME Type List (mimeListPos) ==
== MIME Type List (mimeListPos) ==
Line 162: Line 205:
| parameter || data || ||see parameter len|| (not used) extra parameters                         
| parameter || data || ||see parameter len|| (not used) extra parameters                         
|}
|}
None of the strings should have control characters from U+0000 through U+001F.


=== Linktarget or deleted Entry (DEPRECATED) ===
=== Linktarget or deleted Entry (DEPRECATED) ===
Line 253: Line 298:
== URLs ==
== URLs ==


=== URL Encoding ===
=== URL Encoding in the ZIM ===
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)
The URLs in the UrlPointerlist are encoded in utf-8 and are '''not''' url encoded.
 
For instance, if you store in the ZIM an HTML document with a href pointing to `characters%20%C3%A9ncoding.html`, you have to store the corresponding ZIM entry at `characters éncoding.html` URL.
 
Or if you want to store a ZIM entry at `index.html?param=value`, the HTML document pointing to it will have to use the `index.html%3Fparam%3Dvalue` href.
 
The reason behind it is that libzim is agnostic of which kind of content and which kind of readers will be used. Everything around URL encoding is purely linked to HTTP / HTML / Web standards.
 
When serving web content (which is usually the case), some readers process the requests and already do the url decoding internally, whereas most readers will handle the URLs directly.
 
The same applies to querystring which might be absorbed by some webservers and not passed to the libzim.


Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.
In any case, the reader will have to do the HTTP URL decoding before passing the parameter to libzim.


=== Local Anchors ===
=== Local Anchors ===
Many articles - especially when a table of contents is used - use local anchors to jump within an article.   
Many HTML href - especially when a table of contents is used - use local anchors to jump within a document.   


<pre>
<pre>
Line 265: Line 320:
</pre>
</pre>


The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.
When a web browser is used a reader, it handles these local anchors locally client-side. This is never sent to the webserver, and even less to libzim. The browser will determine by itself if another ZIM entry has to be loaded (local anchor inside another document than the currently shown) and will send a request only with the document URL without the local anchor - in our example "foo". After the document has been loaded the browser will then search for the local anchor tag and jump to the right location.


If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.


Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.
Should you render the article contents by yourself you have to consider this and take care of it before you hand-out requests to libzim.


== Encodings ==
== Encodings ==
12

edits

Navigation menu