Wiki2html

From openZIM
Revision as of 18:59, 21 February 2009 by Josch (talk | contribs) (initial version)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Dumping is REALLY time consuming! Depending on the wikipedia you want to prepare this can take DAYS to WEEKS!

All benchmark results I present here were done on an Intel Core2Quad 6600 overclocked to 3GHz.

Synopsis

You will import the wikipedia database snapshot into a local correctly configured and patched mediawiki, then you dump everything onto your harddrive as a postgresql data dump containing optimized and stripped down html.

install prerequisites

sudo apt-get install apache2 php5 php5-mysql mysql-server php5-xsl php5-tidy php5-cli subversion gij bzip2

or

yum install httpd php php-mysql mysql-server mysql-client php-xml php-tidy php-cli subversion java-1.5.0-gcj bzip2

apache2 is optional and only needed if you want to install via web interface or want to check check wether your data import looks correct.

get to run a local mediawiki

checkout latest mediawiki to whatever folder your webserver of choice publishes and install all you need to set mediawiki up on localhost.

svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3 /var/www

delete the extensions dir and import the official extensions:

rm -rf /var/www/extensions
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions /var/www/extensions

optional: configure /etc/apache2/sites-enabled/000-default so that the mediawiki websetup loads when you access localhost. goto http://localhost and finish the mediawiki install via the webinterface installer. To copypasta the rest of this walkthrough use the root account for mysql. You only have to fill in the values marked red. When everything works proceed to the next step.

a more easy setup is done via manually setting up mediawiki:

echo "CREATE DATABASE wikidb DEFAULT CHARACTER SET binary;" | mysql -u root

then import the table structure

mysql -u root wikidb < wikidb.sql

and put LocalSettings.${LANG}.php in place

GRANT SELECT , INSERT , UPDATE , DELETE , CREATE TEMPORARY TABLES ON `${LANG}wiki` . * TO 'wikiuser'@'%';

configure/modify mediawiki

append to your LocalSettings.php

$wgLanguageCode = "${LANG}";
ini_set( 'memory_limit', 80 * 1024 * 1024 );
require_once( $IP.'/extensions/ParserFunctions/ParserFunctions.php' );
require_once( $IP.'/extensions/Poem/Poem.php' );
require_once( $IP.'/extensions/wikihiero/wikihiero.php' );
require_once( $IP.'/extensions/Cite/Cite.php' );
$wgUseTidy = true;
$wgExtraNamespaces[100] = "Portal"; #also to be changed according to your language
$wgSitename = "Wikipedia";

Edit AdminSettings.php and set mysql user and password so that you can run the maintenance scripts:

cp AdminSettings.sample AdminSettings.php
vim AdminSettings.php

Patch the DumpHTML extension to produce correct output with MediawikiPatch:

patch -p0 < mediawikipatch.diff

You may also enable embedded LaTeX formulas as base64 png images. Just follow these instructions: EnablingLatex

import wikipedia to your mediawiki install

get the template for huge databases

gunzip -c /usr/share/doc/mysql-server-5.0/examples/my-huge.cnf.gz > /etc/mysql/my.cnf

additionally set the following in /etc/mysql/my.cnf

[...]
[mysqld]
[...]
max_allowed_packet=16M
[...]
#log-bin=mysql-bin

and restart mysql-server

check out the available dump for your language at http://download.wikimedia.org/${WLANG}wiki/ $WLANG being de, en, fr and so on. set the appropriate language and the desired timestamp as variables.

export WLANG=<insert your language code here>
export WTIME=<insert the desired timestamp YYYYMMDD>

clean existing tables:

echo "DELETE FROM page;DELETE FROM revision;DELETE FROM text;" | mysql -u root wikidb

add interwiki links

wget -O - http://download.wikimedia.org/${WLANG}wiki/${WDATE}/${WLANG}wiki-${WDATE}-interwiki.sql.gz | gzip -d | sed -ne '/^INSERT INTO/p' > ${WLANG}wiki-${WDATE}-interwiki.sql
mysql -u root wikidb < ${WLANG}wiki-${WDATE}-interwiki.sql

download and import database dump

wget http://download.wikimedia.org/${WLANG}wiki/${WDATE}/${WLANG}wiki-${WDATE}-pages-articles.xml.bz2
bunzip ${WLANG}wiki-${WDATE}-pages-articles.xml.bz2
wget http://download.wikimedia.org/tools/mwdumper.jar
java -Xmx600M -server -jar mwdumper.jar --format=sql:1.5 ${WLANG}wiki-${WDATE}-pages-articles.xml | mysql -u root wikidb
enwiki 52h
dewiki 10h
frwiki 7h
nlwiki 3h

add category links

wget -O - http://download.wikimedia.org/${WLANG}wiki/${WDATE}/${WLANG}wiki-${WDATE}-categorylinks.sql.gz | gzip -d | sed -ne '/^INSERT INTO/p' > ${WLANG}wiki-${WDATE}-categorylinks.sql
mysql -u root wikidb < ${WLANG}wiki-${WDATE}-categorylinks.sql
enwiki 32h
dewiki 1.5h
frwiki 2.5h
jawiki 1h
nlwiki 0.25h

if you installed and configured apache you can now access http://localhost and check if everything is set up as desired.

dump it all

get the id count to do estimations on how to best split the work over your cores

echo "SELECT MAX(page_id) FROM page" | mysql -u root wikidb -sN

with a multicore setup you can dump with multiple threads using the start and endid. be aware that the first articles take longer than the later ones because they are generally bigger.

The following splits were found to be useful:

enwiki 1/32 4/32 11/32 16/32
dewiki 2/16 4/16 5/16 5/16

how long it takes very much depends on your hardware. for example my core2quad Q6600@3GHz is overall four-times faster in dumping mokopedia than my old athlon 64 x2 3600+ when using all four as opposed to two cores.

Dumping is also independent of the speed of the harddisk - even when dumping with a quadcore - the bottleneck is still the processor. So there is also zero speed loss when processes are run in parallel on every core.

php extensions/DumpHTML/dumpHTML.php -d /folder/to/dump -s <startid> -e <endid> --interlang
enwiki 193h
dewiki 14h
frwiki 16h
jawiki 8h
nlwiki 6h

create categories

php extensions/DumpHTML/dumpHTML.php -d /folder/to/dump --categories --interlang
enwiki 28h
dewiki 3h
frwiki 6h
jawiki 3h
nlwiki 1h

Appendix

for debian you might want to remove the database check on every boot - this can take ages with german or english wikipedia. just comment out check_for_crashed_tables; in /etc/mysql/debian-start