Difference between revisions of "2009-11-23 Report Developers Meeting 2009-2"

← Older edit

2009-11-23 Report Developers Meeting 2009-2 (view source)

Revision as of 20:19, 17 January 2010

4,979 bytes added , 20:19, 17 January 2010

→‎article metadata

Manuel Schneider

Bureaucrats, Administrators

848

edits

@@ Line 13: / Line 13: @@
 == Topics ==
 === better suitability for small devices ===
+Small devices are low on memory and don't have a powerful CPU. The OpenWrt team discovered a few problems when working with openZIM to display Wikipedia content:
+;HTML parsing overhead
+Using a full blown HTML parser uses up a lot of ressources. Available HTML engines are much more powerful than needed on these small devices. But as content is stored in HTML format using one of the available HTML engines is a logical way to go.
+In fact on such small devices only very few markup is really needed: Headlines, bold, italic and anchors/links.
+The idea of OpenWrt was to use a special markup for the content which is stable (HTML was considered being unstable as the standard changes once in a while) and much more reduced.
+After a long discussion we came up with the solution to stick with HTML (to give all features of Wikipedia to users on full blown computers) but to use a special parser that ignores everything fancy in the markup and only renders the most neccessary things. That way we would still have some overhead in the ZIM file for small devices due to unused (ignored) HTML code, but it would be no difference in efficiency.
+;Memory Footprint / Caches
+As articles are clustered and stored in bigger compressed chunks, these clusters may not become to big, otherwise the memory available on small devices would be exhausted. The cluster size is currently by default 1 MB - this is the optimal size as compression algorithms themselve use blocks of 1 MB to compress data.
+To reduce the memory footprint the streaming-mode of compression libraries offers a nice solution to only read these parts of a cluster that were needed. In streaming mode reading starts from the beginning of a compressed data stream and all data will be omitted until the pointer index in the uncompressed data stream is reached where the requested content starts.
 === more flexible MIME type list ===
+In prior versions of ZIM the MIME type is specified by an integer, the list of available MIME types is hard-coded in the zimlib.
+To be more flexible in future the hard-coded list will be replaced by a list of zero-terminated strings inside the ZIM data file. Therefore a mimeListPos is added to the ZIM header to specify the position of this MIME type list inside the ZIM file.
 === addressing articles; title vs. URL ===
+We had a discussion about the fact that maybe there is a case where the article names (=titles) are not represented by the URL of an article. Kiwix is currently doing that for reason and changing all URLs to some kind of a short hash key.
+The former idea was to just use the URL as the article identifier and add another field in the directory entry to define the title of a given article. But as we are relying on a working poor-mans-search on small device that do not se fulltext search but do a binary seach on the article index, we decided to add another index. So each article will refenced twice, in the titlePtrList (formerly indexPtrList) and the new urlPtrList. Each list contains the same entries, but once ordered by title and once by URL.
+This means that the ZIM file header gets another field urlPtrPos to reference the start of the urlPtrList.
 === global metadata ===
+For libraries and publishing ZIM files it is important to add some metadata to them, to declare who the publisher is, where the content is derived from, when the file was originally created etc.
+For this we decided to use a new namespace M and put the metadata into single articles for each declaration. The metadata used as a standard (mandatory metadata) is a subset of the Dublin Core metadata.
+Find description here [[Metadata]].
 === article metadata ===
+Similar to global meta data for individual articles can be included. Devices and special readers will only use the actual article content which is already stored in the A namespace. If needed the ZIM creator can add individual meta data as kind of a template for each article in the B namespace, under the same name as the article. When the reader application initializes the ZIM interface in the zimlib it can set if it wants to retrieve the pure article content from namespace A or the processed content of namespace B including the article content. This way maximum flexibility for the reader application is kept.
 === fulltext search ===
 === integer encoding ===
+Currently the integer compression from the [[Zeno File Format]], QUnicode has been used in ZIM as well. For the new indeces we want to look into alternatives that obey a standard, the UTF-8 compression could be a good choice.
+The details have to be set during development as well as the places where to use it. In general all integer compression in ZIM will use the same method to be consistent.
 === lzma compression ===
+A new compression method using LZMA algorithm will be introduced. This is long-planned but there was not library for a long time. Now there are ''lzma-utils'' and ''xz-utils'' available, still in development but we want to give it a try now.
+We expect a much better compression ratio and a faster decompression which would improve the usability on small devices as LZMA is focusing on easy uncompression.
 === future planning ===
@@ Line 52: / Line 87: @@
 The team decided to keep Manuel as project lead with the order to keep on giving talks, careing for marketing and maintaining contacts between openZIM and other projects.
+[[Category:Press_Releases]]

Difference between revisions of "2009-11-23 Report Developers Meeting 2009-2"

2009-11-23 Report Developers Meeting 2009-2 (view source)

Revision as of 20:19, 17 January 2010

Navigation menu

Search