Difference between revisions of "Zimit"

From openZIM
Jump to navigation Jump to search
Line 20: Line 20:


* the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.
* the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.
* In the reader, in Wabac.js, there is a specific part related to ZIM content structure and this is in "RemoteWARCProxy", for the rest the code is the same as before.


== Source code ==
== Source code ==

Revision as of 12:54, 22 May 2023

Zimit is a tool allowing to create a ZIM file of "any" Web site.

Context

openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.

Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.

One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.

Principle

Rhe principles of Zimit are:

  • Crawl the remote WebSite to retrieve all the necessary content
  • Save all the retrieved content in WARC file(s)
  • Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)
  • Read the ZIM file in any Kiwix reader

Player

  • the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.
  • In the reader, in Wabac.js, there is a specific part related to ZIM content structure and this is in "RemoteWARCProxy", for the rest the code is the same as before.

Source code