Difference between revisions of "Build your ZIM file"

From openZIM
Jump to navigation Jump to search
(mwoffliner docker images moved from dockerhub to github container registry (try 2))
 
(69 intermediate revisions by 14 users not shown)
Line 1: Line 1:
There is currently only one binary to build a ZIM file, the [[ZIMwriter]]. This binary uses a pre-filled Postgres Database with a [http://svn.openzim.org/viewvc.cgi/trunk/zimwriter/db/zim-postgresql.sql?view=markup predined scheme]. For example :
[[File:Wikipedia-Book-creator.png|right|thumb|The ''[http://en.wikipedia.org/wiki/Special:Book Wikipedia Book Creator]'' is the easiest way to create custom ZIM files from Wikipedia]]
There are many possibilities to obtain your own ZIM file.  


<source lang="bash">
There are many reasons why you might want your own ZIM: the content is not already ZIMed, you want a special selection of the content, support for a new language, a custom UI inside the ZIM, ...
zimwriter -s 1024 --db "postgresql:dbname=mydb" my_zim_file
</source>


We want to provide in the future a binary able to work with other input (for example: Wikimedia Foundation XML dumps, or HTML directory).
Solutions are ranging from straightforward (but obviously limited) solutions to complex (but mostly unlimited, sky is the limit) ones.


Two tools are currently able to create/fill this database.
== Foolish style ==
Not everyone has tech skills, and even when you have it is usually important to keep things as simple and straightforward as possible.


== buildZimFileFromDirectory.pl ==
=== Request a ZIM file ===
This [http://kiwix.svn.sourceforge.net/viewvc/kiwix/dumping_tools/scripts/buildZimFileFromDirectory.pl?view=markup script] is part of the [http://www.kiwix.org/index.php/Tools Kiwix tools] and allows to build a ZIM file from a HTML directory containing all necessary ressources.


You need:
If your content match openZIM [[Content team#Publishing|publishing policies]], you may ask the Kiwix team to create a ZIM file for you.  
# Checkout the dumping tools : svn co http://kiwix.svn.sourceforge.net/viewvc/kiwix/dumping_tools/
# Install all necessary PERL modules
# run the script like following: ./builZimFileFromDirectory.pl --htmlPath=./html [--indexerPath=./zimindexer] [--zimFilePath=articles.zim]


== [[Wiki2html]] ==
This main limitations is that you have no control on the timeline when the ZIM will be available.
...
 
Kiwix does its best to create ZIMs in a timely manner, but being a free service the resources are limited.
 
It is also possible to pay Kiwix to create ZIMs, and in such a situation the service will of course be much quicker and responsive.
 
To request such a ZIM, simply follow the process described in the [https://github.com/openzim/zim-requests/ zim-requests Github repository].
=== YouZimit ===
[https://zimit.kiwix.org zimit.kiwix.org] is an online website where you can request an automated system to create a ZIM of any online website.
 
Once the ZIM is produced, a download link will be provided to your email address.
 
This automated system rely on Zimit scraper, which is a very versatile solution to scrape any website.
 
Unfortunately, Zimit scraper has some limitations, at least for now.
 
First, Zimit 1.x is relying on a technical solution named Service Workers which limits the readers capable to display the ZIM produced (only kiwix-android, kiwix-serve and kiwix-js ; note that kiwix-desktop on Windows and Linux can start a kiwix-serve for you).
 
Zimit 2.x (planned for 2024) should overcome this limitation, but it is not here yet.
 
Second, Zimit is a scraper, and as such many website will quickly identify it as a bot and refuse to serve any content or serve rubbish (e.g. a captcha).
 
It should also be noted that for fair usage, Youzim.it imposes some limitations: your request will be stopped after 2 hours of processing or if the content retrieved exceeds 4GB.
 
If your use-case match all these limitations, it is clearly the quickest solution to get a ZIM (even if the processing capabilities are limited and your job might end-up in a waiting queue for few hours).
 
It should be noted for now advanced use of [https://zimit.kiwix.org zimit.kiwix.org] requires some technical skills and expert knowledge to configure the advanced options. This process should be enhanced in 2024 to provide more explanations and guide the user in the configuration process.
 
== Ops style ==
If you have some "Ops" skills, mainly around using a command-line tool, you can benefit from a range of existing tools to create your own ZIM on your own machine
 
=== wget-2-zim ===
[https://github.com/ballerburg9005/wget-2-zim wget-2-zim] is a simple bash script based on wget with some nifty tricks that can be used to archive websites on the internet. It comes with very very limited ability to deal with Javascript (compared to Zimit for instance) but works very well for websites composed mostly of HTML/CSS files. Contrary to Zimit 1.x scraper, it does not requires Service Workers to read the resulting ZIM which is hence compatible with all ZIM readers.
 
=== openZIM scrapers ===
Tools created and used by openZIM/Kiwix teams are open-sourced on Github openZIM or Kiwix organizations.
 
This means that every tool used to create ZIMs available online in Kiwix library are readily available for anyone to use.
 
==== MWoffliner ====
 
MWoffliner is a tool which allows to "dump" a Wikimedia project (Wikipedia, Wiktionary, ...) to a local storage. It should also work for any [https://mediawiki.org Mediawiki] instance. It goes through all articles (or a selection if specified) of the project and write HTML/pictures to your local filesystem as plain HTML/JS/CSS/... files or in a ZIM file.
 
It is distributed via [https://www.npmjs.com/package/mwoffliner npm] and [https://github.com/openzim/mwoffliner/pkgs/container/mwoffliner Docker].
 
If you are a developer, you can download it directly from its [https://github.com/openzim/mwoffliner git repository].
 
=== zimwriterfs ===
zimwriterfs is a console tool to create ZIM files from a locally stored directory containing a "self-sufficient" HTML content (with pictures, javascript, stylesheets). The result will contain all the files of the local directory compressed and merged in the ZIM file. Nothing more, nothing less. For now, zimwriterfs only works on POSIX compatible systems. You simply need to compile it and run it. The software does not need a lot of resources, but if you create a pretty big ZIM files, then it could take a while to complete.
Instructions on how to prepare and use zimwriterfs are here [[zimwriterfs_instructions]]
[https://github.com/openzim/zimwriterfs Go to zimwriterfs source code repository].
 
A virtual machine with zimwriterfs is provided [http://download.kiwix.org/dev/ZIMmaker.ova here].
 
== Devs style ==
If you have developments skills, you can create your own tool to create a ZIM from your content. This is what is called it a scraper, even if most of them do not "scrape" a website but used specific techniques like APIs or exported databases.
 
The libzim library (openZIM implementation of the ZIM specification, to read and write ZIM files, written in C++) has bindings available for many programming languages: Python, Node.JS, Java.
 
Since most openZIM scraper are written in Python, there is even a python-scraperlib library providing higher level functions to simplify common scraper tasks. There is even a [[How-to create a Python scraper]] dedicated page.
 
== Older tools ==
There are some older tools that might interest you. They are however not maintained anymore, sometimes even archived, and probably do not work anymore without at least some tinkering.
 
=== Zimbalaka ===
The following description is based on the notes published by the original author of Zimbalaka, as they're no longer available on the site they were published on. An archived copy is available on archive.org https://web.archive.org/web/20150531004251/http://www.arunmozhi.in:80/blog/zimbalaka-an-openzim-creator/#content
 
Zimbalaka, is designed as a web hosted tool which enables #Wikipedia ZIM files to be created based on articles selections.
 
It accepts two types of inputs: a list of pages or a Wikipedia category. Then Zimbalaka downloads those pages, removes all the clutter such as: sidebars, toolbox, edit links, etc., and provides a cleaned version as a ZIM file for download. It can be opened in Kiwix, etc.
 
The ZIM is created with a simple welcome page with all the pages as a list of links.
 
Zimbalaka has multilingual and multi-site support. That is, you can create a ZIM file from pages of any language of the 280+ existing Wikipedias, and also from sites like WikiBooks, Wiktionary, Wikiversity and such. You can even input any custom url like (<nowiki>http://sub.domain.com/</nowiki>), Zimblaka would add (/wiki/Page_title) to it and download the pages.
 
==== Pain points ====
A small pain point is that, Zimbalaka also strips the external references that occur at the end of the Wikipedia articles, as the original author didn’t find these useful content intended to be used in an offline environment.
 
You cannot add a custom Welcome page in the zim file. Not a very big priority. The current file does its work of listing all the pages.
 
You cannot include pages from multiple sites as a single zim file. The workaround is to create multiple files or use a tool called zimwriterfs, which has to be compiled from source (this is used by zimbalaka behind the scenes).
 
==== Developers ====
This tool is written using Flask – A simple Python web framework for the backend, Bootstrap as the frontend and uses the zimwriterfs compiled binary as the workhorse. The zimming tasks are run by Celery, which has been automated by supervisord. All the co-ordination and message passing happens via Redis.
 
[https://github.com/tecoholic/Zimbalaka Here is the source code].
 
=== zimwriterdb ===
[[zimwriterdb]] is part of the openZIM project. This binary uses a pre-filled Postgres Database and create the corresponding ZIM file; the schema for the database is linked on the main zimwriterdb page.
 
=== Wiki2html ===
[[Wiki2html]] can be used to prepare static HTML files from a running Mediawiki instance.
 
===zimmer===
The [https://github.com/vss-devel/zimmer zimmer] package allows creating a ZIM dump from a Mediawiki-based wiki. This package is relatively easy to install and supports both old and new versions of Mediawiki. It is a kind of an alternative to MWoffliner.
 
The package consists of two Node.js scripts:
* ''wikizimmer.js'' -- creates static HTML files from the wiki's articles. It requires public access both to the normal web interface and to the wiki's API interface. Unlike mwoffliner, this script does not require Redis
* ''zimmer.js'' -- creates a ZIM file from the static HTML files (without requiring the libzim).


== See also ==
== See also ==
* [[Publish your ZIM File]]
* [[Bindings]]
* [[Readers]]

Latest revision as of 04:13, 19 August 2024

The Wikipedia Book Creator is the easiest way to create custom ZIM files from Wikipedia

There are many possibilities to obtain your own ZIM file.

There are many reasons why you might want your own ZIM: the content is not already ZIMed, you want a special selection of the content, support for a new language, a custom UI inside the ZIM, ...

Solutions are ranging from straightforward (but obviously limited) solutions to complex (but mostly unlimited, sky is the limit) ones.

Foolish style

Not everyone has tech skills, and even when you have it is usually important to keep things as simple and straightforward as possible.

Request a ZIM file

If your content match openZIM publishing policies, you may ask the Kiwix team to create a ZIM file for you.

This main limitations is that you have no control on the timeline when the ZIM will be available.

Kiwix does its best to create ZIMs in a timely manner, but being a free service the resources are limited.

It is also possible to pay Kiwix to create ZIMs, and in such a situation the service will of course be much quicker and responsive.

To request such a ZIM, simply follow the process described in the zim-requests Github repository.

YouZimit

zimit.kiwix.org is an online website where you can request an automated system to create a ZIM of any online website.

Once the ZIM is produced, a download link will be provided to your email address.

This automated system rely on Zimit scraper, which is a very versatile solution to scrape any website.

Unfortunately, Zimit scraper has some limitations, at least for now.

First, Zimit 1.x is relying on a technical solution named Service Workers which limits the readers capable to display the ZIM produced (only kiwix-android, kiwix-serve and kiwix-js ; note that kiwix-desktop on Windows and Linux can start a kiwix-serve for you).

Zimit 2.x (planned for 2024) should overcome this limitation, but it is not here yet.

Second, Zimit is a scraper, and as such many website will quickly identify it as a bot and refuse to serve any content or serve rubbish (e.g. a captcha).

It should also be noted that for fair usage, Youzim.it imposes some limitations: your request will be stopped after 2 hours of processing or if the content retrieved exceeds 4GB.

If your use-case match all these limitations, it is clearly the quickest solution to get a ZIM (even if the processing capabilities are limited and your job might end-up in a waiting queue for few hours).

It should be noted for now advanced use of zimit.kiwix.org requires some technical skills and expert knowledge to configure the advanced options. This process should be enhanced in 2024 to provide more explanations and guide the user in the configuration process.

Ops style

If you have some "Ops" skills, mainly around using a command-line tool, you can benefit from a range of existing tools to create your own ZIM on your own machine

wget-2-zim

wget-2-zim is a simple bash script based on wget with some nifty tricks that can be used to archive websites on the internet. It comes with very very limited ability to deal with Javascript (compared to Zimit for instance) but works very well for websites composed mostly of HTML/CSS files. Contrary to Zimit 1.x scraper, it does not requires Service Workers to read the resulting ZIM which is hence compatible with all ZIM readers.

openZIM scrapers

Tools created and used by openZIM/Kiwix teams are open-sourced on Github openZIM or Kiwix organizations.

This means that every tool used to create ZIMs available online in Kiwix library are readily available for anyone to use.

MWoffliner

MWoffliner is a tool which allows to "dump" a Wikimedia project (Wikipedia, Wiktionary, ...) to a local storage. It should also work for any Mediawiki instance. It goes through all articles (or a selection if specified) of the project and write HTML/pictures to your local filesystem as plain HTML/JS/CSS/... files or in a ZIM file.

It is distributed via npm and Docker.

If you are a developer, you can download it directly from its git repository.

zimwriterfs

zimwriterfs is a console tool to create ZIM files from a locally stored directory containing a "self-sufficient" HTML content (with pictures, javascript, stylesheets). The result will contain all the files of the local directory compressed and merged in the ZIM file. Nothing more, nothing less. For now, zimwriterfs only works on POSIX compatible systems. You simply need to compile it and run it. The software does not need a lot of resources, but if you create a pretty big ZIM files, then it could take a while to complete. Instructions on how to prepare and use zimwriterfs are here zimwriterfs_instructions Go to zimwriterfs source code repository.

A virtual machine with zimwriterfs is provided here.

Devs style

If you have developments skills, you can create your own tool to create a ZIM from your content. This is what is called it a scraper, even if most of them do not "scrape" a website but used specific techniques like APIs or exported databases.

The libzim library (openZIM implementation of the ZIM specification, to read and write ZIM files, written in C++) has bindings available for many programming languages: Python, Node.JS, Java.

Since most openZIM scraper are written in Python, there is even a python-scraperlib library providing higher level functions to simplify common scraper tasks. There is even a How-to create a Python scraper dedicated page.

Older tools

There are some older tools that might interest you. They are however not maintained anymore, sometimes even archived, and probably do not work anymore without at least some tinkering.

Zimbalaka

The following description is based on the notes published by the original author of Zimbalaka, as they're no longer available on the site they were published on. An archived copy is available on archive.org https://web.archive.org/web/20150531004251/http://www.arunmozhi.in:80/blog/zimbalaka-an-openzim-creator/#content

Zimbalaka, is designed as a web hosted tool which enables #Wikipedia ZIM files to be created based on articles selections.

It accepts two types of inputs: a list of pages or a Wikipedia category. Then Zimbalaka downloads those pages, removes all the clutter such as: sidebars, toolbox, edit links, etc., and provides a cleaned version as a ZIM file for download. It can be opened in Kiwix, etc.

The ZIM is created with a simple welcome page with all the pages as a list of links.

Zimbalaka has multilingual and multi-site support. That is, you can create a ZIM file from pages of any language of the 280+ existing Wikipedias, and also from sites like WikiBooks, Wiktionary, Wikiversity and such. You can even input any custom url like (http://sub.domain.com/), Zimblaka would add (/wiki/Page_title) to it and download the pages.

Pain points

A small pain point is that, Zimbalaka also strips the external references that occur at the end of the Wikipedia articles, as the original author didn’t find these useful content intended to be used in an offline environment.

You cannot add a custom Welcome page in the zim file. Not a very big priority. The current file does its work of listing all the pages.

You cannot include pages from multiple sites as a single zim file. The workaround is to create multiple files or use a tool called zimwriterfs, which has to be compiled from source (this is used by zimbalaka behind the scenes).

Developers

This tool is written using Flask – A simple Python web framework for the backend, Bootstrap as the frontend and uses the zimwriterfs compiled binary as the workhorse. The zimming tasks are run by Celery, which has been automated by supervisord. All the co-ordination and message passing happens via Redis.

Here is the source code.

zimwriterdb

zimwriterdb is part of the openZIM project. This binary uses a pre-filled Postgres Database and create the corresponding ZIM file; the schema for the database is linked on the main zimwriterdb page.

Wiki2html

Wiki2html can be used to prepare static HTML files from a running Mediawiki instance.

zimmer

The zimmer package allows creating a ZIM dump from a Mediawiki-based wiki. This package is relatively easy to install and supports both old and new versions of Mediawiki. It is a kind of an alternative to MWoffliner.

The package consists of two Node.js scripts:

  • wikizimmer.js -- creates static HTML files from the wiki's articles. It requires public access both to the normal web interface and to the wiki's API interface. Unlike mwoffliner, this script does not require Redis
  • zimmer.js -- creates a ZIM file from the static HTML files (without requiring the libzim).

See also