Difference between revisions of "Zimit"

Jump to navigation Jump to search
6,974 bytes added ,  16:11, 23 May 2023
no edit summary
Line 33: Line 33:
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content
* [https://github.com/openzim/zimit Zimit], the packaing within a Docker image of both Browsertrix and Warc2zim
* [https://github.com/openzim/zimit Zimit], the packaging within a Docker image of both Browsertrix and Warc2zim
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]
== Current implementation workflow (to be confirmed) ==
=== At creation time ===
* Browsertrix create somehow a WARC file.
* warc2zim is converting the warc file into a zim file. To do so it does:
** Loop on all records in the WARC file.
** For each record:
*** Extract the url : "urlkey" if present, else "WARC-Target-URI"
*** Add a `H/<url>`, containing the Headers of the record
*** Add a `A/<url>`, the content (payload) of the record (if record is not a revisist)  If content is html, it also insert a small js script which redirect to index.html if SW is not loaded.
** Add the wabac.js replayer (which also "contains" wombat).
** Add a "front page" (index.html) which loads the wabac SW when opened.
** Add a "top frame" page with a iframe and small script (mainly in charge to sync history and icons).
=== At reading ===
* User goes to a page. If SW is not loaded, inserte script redirect to index.html, which load SW and  register itself as new collection (using "top frame" as top page) and redirect to request page once collection is added.
* SW handle the URL, it does:
** Find the rigth collection (base on book name)
** make coll.handleRequest
*** does `getReplayResponse`
**** does store.getResource()
***** Do a request for `H/URL` and if not found, generate "fuzzy url" and do request `H/fuzzyurl` for each fuzzy url.  Once it found a `H/url` it stops. If it doesn't found a header return null
***** If header is revisit, resolve it (by doing another request to `H/target_url`
***** At the end, get the payload by doing `A/final_url`
***** Build a ArchiveResponse with header and payload
*** insert js script loading wombat in the html content.
*** rewrite the ArchiveResponse content.
*** merge header from ArchiveResponse into the SW response ("range", "date", "set-cookis", ...)
*** return response to iframe
   
Wombat is loaded in all pages as a web worker. Js code is wrapped in a wombat context which rewrite outgoing url (fetch/location changes/...) before doing the request itself.
=== Comparaison with pywb. ===
The workflow of pywb (with a WARC archive) is almost the same but with small simplification as the rewriting part and fuzzymatching is made by the server itself without serviceworker.
* User goes to a specific url (helped with frontend ui).
* pywb get the url, search for the record (potentially with fuzzy matching).
* Once it has the record, it rewrite the payload and it return a response (merging the record's headers in the response).
Rewriting the payload is the same as what is done in the SW (replace html/css link and insert wombat load)
At the end, all links are relative (or point to the server).
=== Rewriting urls ===
See documentation at https://pywb.readthedocs.io/en/latest/manual/rewriter.html
All(?) the rewriting is the following :
<code>abs_url</code> -> <code><server_host>/<collection>/<timestamp><modifier>/abs_url</code>.
* <code><collections></code> is the name for the "set of record" (a warc ?, several ?). In our case, it is the book name
* <code><timestamp></code> is necessary as a collection may contains records for different scrapping. In our case we have one scrapping per book (and so per collection)
* <code><modifier></code> is how we should rewrite the content:
** <code>id_</code> is no modification (identical)
** <code>mp_</code> is main page. As modification is base on the content type, `mp_` can be applyied to all type of content.
** <code>js_</code> and <code>cs_</code>. Force a modification as js or css event if content type is something else (html).
** <code>im_</code>, <code>oe_</code>, <code>if_</code>, <code>fr_</code> Historical modifier, same as <code>mp_</code>
=== Rewriting the content ===
CSS rewriter : rewrite links
JS rewriter: rewrite few links but mostly wrap the code in a "wombat context".
JSONP rewriter: May rewrite the content base on the request's querystring (!!!!!)
== Proposed solution ==
=== At creation ===
Use pywb rewritter module to statically rewrite the content (record payload) at zim creation time.
Few things can be done statically:
* <code><timestamp></code>: we could remove it (or we know it)
* <code><modifier></code>: depends of the content type and we know it
* <code><url></code>: Is in the record's header
Few things may be not possible to do statically:
* <code><server_host></code>: depends of the production environement (host name, root prefix)
* <code><collection></code>: depends of the zim filename (we may change to base ourself on zimid ?)
* <code><requested_url></code>: In case of "revisit", pywb and wabac return the content of another record.  It rewrite the content based on "the requested url or the record url ?".  The same way, in case of fuzzymatching, request url is diferent than record url.
* jsonp need access to the "callback" querystring value of the request.
We could do the static rewriting by setting placeholder (<code>${RW_SERVER_HOST}</code>, <code>${RW_URL}</code>, ...) for things that needs to be rewritten dynamically.
=== At reading ===
* Make libzim/libkiwix understand warc headers (or should we define our and rewrite warc headers in our format ?)
* Make libzim/libkiwix do fuzzy matching using headers info and fuzzy matching rules (defined https://github.com/webrecorder/pywb/blob/main/pywb/rules.yaml or (https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js) https://github.com/webrecorder/pywb/blob/main/pywb/warcserver/index/fuzzymatcher.py
* Once a payload is found, dynamically replace placeholders
* Return content
==== General workflow on kiwix-serve (WIP): ====
For a given requested <code>/content/<book_name>/<url></code>
# Search for zim file corresponding to <code><book_name></code>.
# Search for <code>C/<url></code>
## If Found => Answer with content of <code>C/url</code> (with dynamic rewrite).  If <code>H/url</code> set the http response headers with <code>H/url</code>'s headers.
## If not found, search for <code>H/url</code> as it may be a revisit
### If found, replace `url` by revisit target and do 2.
# If no answer by 2.
## If fuzzy rules definition is present in the zim files (<code>W/fuzzy_rules</code> ?), generate fuzzy urls and do 2. with each fuzzy rule
# If no fuzzy rules match, answer 404
 
This workflow should be compatible with existing zim files (no <code>H</code> nor <code>W/fuzzy_rules</code>).
Searching by <code>C/url</code> first allow to avoid putting a <code>H/url</code> for the commmon case, even for warc2zim files.
This allow potential fuzzy matching for other zim files (specific scrapper)
Should be pretty "easy" to implement if we defined well:
* The possible placeholders (<code>${RW_SERVER_HOST}</code>, ...) and their value
* The header <code>H/url</code> format (just a subset of header to apply ?)
* The fuzzy rules (how to generate fuzzy url from the data driven fuzzy_rules). https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js seems to be a good "specification".


== Questions ==
== Questions ==
31

edits

Navigation menu