Difference between revisions of "ZIM file format"

Jump to navigation Jump to search
1,009 bytes added ,  10:39, 17 October 2010
no edit summary
Line 47: Line 47:
| ...                    ||  string ||  ... ||  ... || ...
| ...                    ||  string ||  ... ||  ... || ...
|-
|-
| <last entry / end>    ||  string ||  n/a ||    0 || empty string - zero terminated
| <last entry / end>    ||  string ||  n/a ||    0 || empty string - end of MIME type list - zero terminated
|}
|}


Line 147: Line 147:
! Field Name            !! Type    !!Offset!!Length!! Description
! Field Name            !! Type    !!Offset!!Length!! Description
|-
|-
| <1st Cluster>          || integer ||    0 ||    8 || Pointer to the <1st Cluster>
| <1st Cluster>          || integer ||    0 ||    8 || pointer to the <1st Cluster>
|-
|-
| <1st Cluster>          || integer ||    8 ||    8 || Pointer to the <2nd Cluster>
| <1st Cluster>          || integer ||    8 ||    8 || pointer to the <2nd Cluster>
|-
|-
| <nth Cluster>          || integer ||(n-1)*8||  8 || Pointer to the <nth Cluster>
| <nth Cluster>          || integer ||(n-1)*8||  8 || pointer to the <nth Cluster>
|-
|-
| ...                    || integer || ...  ||    8 || ...
| ...                    || integer || ...  ||    8 || ...
Line 157: Line 157:


== Clusters ==
== Clusters ==
The clusters contain the actual article data. This file section contain a list of clusters, which contain a list of blobs each. The blob is the data of one specific article. So this blob is adressed by the cluster number and the blob number in this cluster. The cluster number is used to look up the file offset in the cluster pointer list.
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.


The cluster has a starting byte, which indicated, which compresion is used. After this byte, all other data is compressed. Possible values are:
The first byte of the cluster identifies if it is compressed (4) or not (0). The default is uncompressed indicated by a value of 0 or 1 (obsoleted, inherited by Zeno) while compressed clusters are indicated by a value of 4 which indicates LZMA2 compression. There have been other compression algorithms used before (2: zlib, 3: bzip2) which have been removed. The zimlib uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://github.com/abartov/LZMA2-java LZMA2-java].
* 0 default (no compression)
* 1 none also no compression (inherited from zeno)
* 2 zlib
* 3 bzip2
* 4 lzma2 (default compression format)


Support for zlib and bzip2 is deprecated. By default it is not compiled into the library any more. Only lzma2 is used. The zimlib uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://github.com/abartov/LZMA2-java LZMA2-java].
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.


The data area has a list of 4 byte offsets to the blobs counting from the first offset. The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by 4. The size of one blob is calculated by the difference of two consecutive offsets.
{|{{Prettytable}}
! Field Name            !! Type    !!Offset!!Length!! Description
|-
| compression type      || integer ||    0 ||    1 || 0: default (no compression), 1: none (inherited from Zeno), 4: LZMA2 compressed
|-
|colspan=5| The following data bytes have to be uncompressed!
|-
| <1st Blob>            || integer ||    1 ||    4 || pointer to the <1st Blob>
|-
| <2nd Blob>            || integer ||    5 ||    4 || pointer to the <2nd Blob>
|-
| <nth Blob>            || integer ||(n-1)*4+1|| 4 || pointer to the <nth Blob>
|-
| ...                    || integer || ...  ||    4 || ...
|-
| <last blob / end>      || integer ||  n/a ||    4 || pointer to the end of the cluster
|-
| <1st Blob>            || data    ||  n/a  || n/a || data of the <1st Blob>
|-
| <2nd Blob>            || data    ||  n/a  || n/a || data of the <2nd Blob>
|-
| ...                    || data    || ...  ||  n/a || ...
|}
 
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by 4. The size of one blob is calculated by the difference of two consecutive offsets.


== Namespaces ==
== Namespaces ==
Namespaces seperate different types of data stored in the ZIM File Format.
Namespaces seperate different types of directory entries - which might have the same title - stored in the ZIM File Format.


They can be distinguished by prepending the article namespace before the article name in the URL path, eg. ''http://localhost/A/Articlename''.
They can be distinguished by prepending the article namespace before the article name in the URL path, eg. ''http://localhost/A/Articlename''.

Navigation menu