Zimdiff

From openZIM
Revision as of 18:10, 16 June 2013 by Kiran mathew 1993 (talk | contribs) (Added details of the proposed zimdiff tool.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

zimdiff is a proposed tool in order to facilitate incremental updates for a large ZIM file. It will be written using the zimlib library. The zimdiff is released under the GPLv2 license terms. Note that zimdiff is currently under development. The bugzilla page can be found here

This page discusses the details of the zimdiff tool.

The Zimdiff tool will be used to generate a diff_file between two normal zim files. Lets call them start_file and end_file.

diff_file format

A diff_file will be anormal ZIM file, with some additional data. The ZIM diff file must contain the necessary data to allow to make: start_file + diff_file = end_file

Actions that need to be performed using the diff_file:
1. add
2. remove
3. update

The diff_file will store all articles that have to be added to the start_file. A list of such articles will be maintained in a metadata article. Another article in metadata will contain a list of articles to be removed from the start_file. For updating an article, there will be two options.
1.Store the new article among the list of articles to be added.
2. Store the diff (generated by a diff algorithm )between the old article and the new article in a separate article in the diff_file. A list of such diff articles will be maintained in metadata.

Using the above format, the diff_file can be used to store the difference between the start_file and the end_file, and can be used to update the start_file to obtain the end_file using the zimpatch tool.

Pseudocode for zimdiff

class articleInfo
{
    std::string Title;
    int hash;
    int index;
};

1. Start, open '''start_file''' and '''end_file'''
2. Parse through '''start_file''' and '''end_file''', add data to a linked list of articleInfo objects.('''start_list''' and '''end_list''')
3. Sort the lists(for faster searches later on)
4. For each articleInfo object in '''start_list''', loop through the '''end_list''' searching for an article with the same title.
5.   If no article is found, move the object to a '''delete_list''' (list of articles to be deleted),and 
6.   If an article is found, compare the hashes. If the hashes are the same, then the article is to remain unchanged. Delete the entry from both the lists.
7.   If the hashes are different, then the article has changed. Move the articleInfo object (from the end_list) to the '''update_list'''.
8. Once all articles have been processed in the '''start_list''', all the remaining objects in the '''end_list''' are newly added articles. 
Add them directly to the '''add_list'''.
9. Start writing the '''diff_file'''.
10. Add all articles in the add_list to the '''diff_file'''. Create a list of these article titles(in XML format) and store it as metadata.
11. Add all articles in the '''update_list''' to the '''diff_file'''.
 Create a different article, containing a list of the titles of these articles(in XML format) and store it as metadata.
12. Create a list of titles of articles(XML format) in '''delete_list'''. Store the article as metadata.
13. End.