Difference between revisions of "Zimit"

Revision as of 13:31, 23 May 2023

Zimit is a tool allowing to create a ZIM file of "any" Web site.

Context

openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.

Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.

One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.

Principle

The principles of Zimit are:

Crawl the remote WebSite to retrieve all the necessary content
Save all the retrieved content in WARC file(s)
Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)
Read the ZIM file in any Kiwix reader

Player

the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.
In the reader Wabac.js, there is only one specific part related to ZIM content structure and this is in "RemoteWARCProxy". This part knows how to retrieve content from the specific ZIM storage backend. For the rest the code is the same as before.
Regarding URL rewriting itself, we have two kinds which are both data-driven:
- The static URL rewriting which is done with Wombat
- The Fuzzy matching which is done within the ServiceWorker
The URL rewriting is done at two levels:
- When the javascript code calls specific Browsers API, these calls are superseeded and ultimatively call Wonbat
- When a URL is called, then it goes through the service-worker which does the fuzzy-matching and the URL rewriting.

Source code

Browsertrix crawler, the Web crawler which gather everything in a WARC file
- Wombat a standalone client-side URL rewriting system
Warc2zim, a command line tool transforming a WARC file to a ZIM file
Wabac.js is the ServiceWorker based reader for the content
Zimit, the packaing withing a Docker image of both Browsertrix and Warc2zim
Zimit frontend, which is the Web UI use for the Zimit SaaS solution youzim.it

Questions

Kelson

How well maintained is the Python server Pywb? Who use it?
Do we have other places on top of "RemoteWARCProxy" where we have javascript code dedicated to Kiwix in Wabac/Wonbat?
I URL rewriting really data-driven? Same question for Fuzzy-matching?

@@ Line 35: / Line 35: @@
 * [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim
 * [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]
+== Questions ==
+=== Kelson ===
+* How well maintained is the Python server Pywb? Who use it?
+* Do we have other places on top of "RemoteWARCProxy" where we have javascript code dedicated to Kiwix in Wabac/Wonbat?
+* I URL rewriting really data-driven? Same question for Fuzzy-matching?