2009-11-23 Report Developers Meeting 2009-2

From openZIM
Jump to navigation Jump to search

The second openZIM Developers Meeting took place November 20th to 22nd.

Participants

  1. Tommi Mäkitalo (tntnet)
  2. Emmanuel Engelhart (Kiwix)
  3. Tomasz Finc (Wikimedia Foundation)
  4. Mirko Lindner (Qi Hardware)
  5. Mirko Voigt (OpenWrt)
  6. Pascal Martin (Linterweb)
  7. Guillaume Duhamel (Linterweb)
  8. Manuel Schneider (Wikimedia CH)

Topics

better suitability for small devices

Small devices are low on memory and don't have a powerful CPU. The OpenWrt team discovered a few problems when working with openZIM to display Wikipedia content:

HTML parsing overhead

Using a full blown HTML parser uses up a lot of ressources. Available HTML engines are much more powerful than needed on these small devices. But as content is stored in HTML format using one of the available HTML engines is a logical way to go.

In fact on such small devices only very few markup is really needed: Headlines, bold, italic and anchors/links.

The idea of OpenWrt was to use a special markup for the content which is stable (HTML was considered being unstable as the standard changes once in a while) and much more reduced.

After a long discussion we came up with the solution to stick with HTML (to give all features of Wikipedia to users on full blown computers) but to use a special parser that ignores everything fancy in the markup and only renders the most neccessary things. That way we would still have some overhead in the ZIM file for small devices due to unused (ignored) HTML code, but it would be no difference in efficiency.

Memory Footprint / Caches

As articles are clustered and stored in bigger compressed chunks, these clusters may not become to big, otherwise the memory available on small devices would be exhausted. The cluster size is currently by default 1 MB - this is the optimal size as compression algorithms themselve use blocks of 1 MB to compress data.

To reduce the memory footprint the streaming-mode of compression libraries offers a nice solution to only read these parts of a cluster that were needed. In streaming mode reading starts from the beginning of a compressed data stream and all data will be omitted until the pointer index in the uncompressed data stream is reached where the requested content starts.

more flexible MIME type list

In prior versions of ZIM the MIME type is specified by an integer, the list of available MIME types is hard-coded in the zimlib.

To be more flexible in future the hard-coded list will be replaced by a list of zero-terminated strings inside the ZIM data file. Therefore a mimeListPos is added to the ZIM header to specify the position of this MIME type list inside the ZIM file.

addressing articles; title vs. URL

We had a discussion about the fact that maybe there is a case where the article names (=titles) are not represented by the URL of an article. Kiwix is currently doing that for reason and changing all URLs to some kind of a short hash key.

The former idea was to just use the URL as the article identifier and add another field in the directory entry to define the title of a given article. But as we are relying on a working poor-mans-search on small device that do not se fulltext search but do a binary seach on the article index, we decided to add another index. So each article will refenced twice, in the titlePtrList (formerly indexPtrList) and the new urlPtrList. Each list contains the same entries, but once ordered by title and once by URL.

This means that the ZIM file header gets another field urlPtrPos to reference the start of the urlPtrList.

global metadata

For libraries and publishing ZIM files it is important to add some metadata to them, to declare who the publisher is, where the content is derived from, when the file was originally created etc.

For this we decided to use a new namespace M and put the metadata into single articles for each declaration. The metadata used as a standard (mandatory metadata) is a subset of the Dublin Core metadata.

Find description here Metadata.

article metadata

Similar to global meta data for individual articles can be included. Devices and special readers will only use the actual article content which is already stored in the A namespace. If needed the ZIM creator can add individual meta data as kind of a template for each article in the B namespace, under the same name as the article. When the reader application initializes the ZIM interface in the zimlib it can set if it wants to retrieve the pure article content from namespace A or the processed content of namespace B including the article content. This way maximum flexibility for the reader application is kept.

fulltext search

integer encoding

Currently the integer compression from the Zeno File Format, QUnicode has been used in ZIM as well. For the new indeces we want to look into alternatives that obey a standard, the UTF-8 compression could be a good choice.

The details have to be set during development as well as the places where to use it. In general all integer compression in ZIM will use the same method to be consistent.

lzma compression

A new compression method using LZMA algorithm will be introduced. This is long-planned but there was not library for a long time. Now there are lzma-utils and xz-utils available, still in development but we want to give it a try now.

We expect a much better compression ratio and a faster decompression which would improve the usability on small devices as LZMA is focusing on easy uncompression.

future planning

Developers Meetings

As the personal meetings are vital for the project we think that at least two meetings during the year would be helpful. The next meeting should take place around April in 2010.

As always the location and organisation is open for everyones ideas, the planning page is already opened at Developers Meeting/2010-1.

For this meeting it was planned to rent an appartment which is big enough to accommodate all participants, offers internet access, a meeting room and a kitchen for cheap catering on-site. It was great that the number of participants increased that much, but we were not able to find a suitable appartment for that many persons. As an alternative we wanted to rent a conference venue with full-service. Even though we got good offers it was still too much to fit into our budget. So we sticked with the cheap self-made solution.

Marketing

As the target group of LinuxTag has big overlaps with ours, but is not the one we are aiming at, we are unsure if we should participate in LinuxTag 2010. We will decide that by the time as this is mainly dependant on volunteers offering support to run a booth.

Peering with other groups that have a similar mission is considered to be more fruitful. If possible we would like to see openZIM presented at SkoleLinux, Linux4Africa, OLPC and of course Wikimania. If someone has contacts to these groups or is willing to participate in conferences and give talks about openZIM, please go ahead and get in touch with us.

Budget

As described above we are planning to increase our budget to be more flexible on how to organise the Developers Meetings. As the team and the expectations to the projects grows we need to professionalise our organisation. Especially as we are all volunteers we have to reduce all work which does not directly support the development of ZIM and openZIM software.

The plan is to calculate two Developers Meetings plus two participations in other conferences according to our experiences and the offers we got this year. This calculations will be the basis for the budget we want to ask for from Wikimedia CH.

The budget will also include costs for the openzim.org domain and server hosting, but these are only a very low percentage (~ 250 EUR / a).

project lead

Manuel offered his position as project lead to whoever might be interested in doing it, especially as he feels committed to the project and gave some talks, but does not actively work on the development.

The team decided to keep Manuel as project lead with the order to keep on giving talks, careing for marketing and maintaining contacts between openZIM and other projects.