Libzim

From openZIM
Revision as of 19:26, 11 January 2013 by Kelson (talk | contribs)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

ZIMlib is the library which implements the access to ZIM files. Use libzim in your own software - like reader applications - to make them ZIM-capable without the need having to dig too much into the ZIM file format.

Programming

Introduction

zimlib is written in C++. To use the library, you need the include files of zimlib have to link against libzim. Both are installed when zimlib is built with the normal "./configure; make; make install".

Errors are handled with exceptions. When something goes wrong, zimlib throws an error, which is always derived from std::exception.

All Classes are defined in the namespace zim. Copying is allowed and tried to make as cheap as possible. The library is not thread safe by itself. You have to serialize access to the class yourself.

The main class, which accesses the file is zim::File. It has actually a reference to a implementation, so that copies of the class just references the same file. You open a file by passing the file name to the constuctor as a std::string.

The API tries to resemble the standard C++ library, so that a zim::File works like a container of instances of zim::Article. It has a const_iterator, which is created using zim::File::begin(). Dereferencing the iterator gives the zim::Article. The iterator may be incremented to point to the next article until it reaches zim::File::end(). Iterators pointing to that must not be dereferenced nor incremented.

When the iterator is created using zim::File::beginByTitle(), the articles are ordered by title. Otherwise the url field is used.

Sample
List all articles titles inside a ZIM file
#include <zim/file.h>
#include <zim/fileiterator.h>
#include <iostream>

int main(int argc, char* argv[]) 
{
  try
  {
    zim::File f("wikipedia.zim");
  
    for (zim::File::const_iterator it = f.begin(); it != f.end(); ++it)
    {
      std::cout << "url: " << it->getUrl() << " title: " << it->getTitle() << '
';
    }
  }
  catch (const std::exception& e)
  {
    std::cerr << e.what() << std::endl;
  }
}

You may save that file under the name "zimlist.cpp" and compile using the command: g++ -o zimlist -lzim zimlist.cpp. You get a program, which lists the urls and titles of the file named wikipedia.zim. Of course it is better to pass that name as a parameter in argc/argv. But this should be an easy task for you, so I do not show that.

In subsequent examples I show only the code needed to use the library. The main-function with the error catcher should always be in place.

Finding articles

Articles are addressed either by index or by namespace and url or title. The index is normally not that useful. So let us look how to find a specific article.

zim::File has methods find and findByTitle. Both take 2 parameters. A char for the namespace (which is normally 'A' for articles) and a string, which specifies the url (in find) or the title (in findByTitle). It returns a const_iterator pointing to the lexicographically next article. Be aware, that the returned iterator may point to end(), so you should check that, before dereferencing the iterator. I should add, that both - the url and title - are case sensitive.

Sample
find a article by title and print the content:
zim::File::const_iterator it = file.findByTitle('A', "Wikipedia");
if (it == file.end())
  throw std::runtime_error("article not found");
if (it->isRedirect())
  std::cout << "see: " << it->getRedirectArticle().getTitle() << std::endl;
else
  std::cout << it->getData();

Incrementing the iterator iterates through the file using url or title order, depending, which method created the iterator.

The method zim::Article::getData() returns the actual data as a instance of zim::Blob. This class has a method data(), which returns a pointer (const char*) to the begin of the data and size(), which returns the size of the article. Be aware, that the data is not zero terminated. Zim files can contain binary data like images, which may have zero bytes in the data.

The data is valid as long as the article is valid.

If you really want zero terminated data since you know, that it do not contain zero bytes, you may use the Blob to create a std::string and use the c_str()-method.

To make life easier, a ostream-operator for the zim::Blob is implemented.

Sample
get zero-terminated string out of the article data:
zim::Article article = ...; // get the article somewhere
zim::Blob blob = article.getData();
std::string stringdata = std::string(blob.data(), blob.size());
const char* zptr = stringdata.c_str();  // c_str() guarantees, that the pointer points to zero terminated data

find and findByTitle return the lexicographically next article. If you whan to know, if you really got exactly the article you requested, you may either compare the url (or title) to the parameter or use findx or findxByTitle. Both return a std::pair, with the first element a bool, which is true, if a exact match was found and false otherwise. The second element is the actual iterator, as returned by find or findByTitle. The latter are actually only wrappers around the findx-methods throwing away the flag. If the flag is true, the iterator does point to a article. It is not necessary to check against end() any more.

Sample
find the exact article by url
std::pair<bool, zim::File::const_iterator> it = file.findxByTitle('A', "Wikipedia");
if (!it.first)
  throw std::runtime_error("article not found");
if (it->isRedirect())
  std::cout << "see: " << it->getRedirectArticle().getTitle() << std::endl;
else
  std::cout << it->getData();

using the Search-class

The class zim::Search adds 2 search features to zimlib. Both fill a result object of type zim::Search::Results. This result object is actually a std::vector of a zim::SearchResult. The zim::SearchResult holds a article and a priority. Since copying large result sets is quite expensive, all search methods of this expects a reference to a existing zim::Search::Results, which is filled.

The simpler searches articles by title either by a praefix or by a range of titles. You pass a namespace and either a single std::string and you get all articles, whose title starts with that string or you pass two strings and you get all articles whose title is between these strings (including the passed strings).

The method zim::Search::search implements a full text search. For that it needs the full text index file. The full text index file is also a zim file, so you pass the index file as second paramter of type zim::File. The search-method is then called with the SearchResults-reference and a string, containing a space delimited list of words. The full text search then returns a list of articles where at least one of the words is found. Each article is weighted and the result is sorted by weight starting with the highest weighted article.


Methods

zim::file::const_iterator = zim::file::findx(char ns, const std::string& title)

returns an iterator to the article named title in namespace ns

zim::file::const_iterator = zim::file::findx(const std::string& url)

returns an iterator to the article with the (non-encoded) URL url

std::string = zim::urldecode(const std::string& url)

returns an URL-decoded string for use in findx()

Requirements

If you compile the software from source you need the following libraries installed:

  • libxz-dev
  • autoconf
  • automake
  • libtool