Difference between revisions of "Zimit"

Jump to navigation Jump to search
4,251 bytes added ,  12:31, 24 May 2023
 
(9 intermediate revisions by 2 users not shown)
Line 53: Line 53:
=== At reading ===
=== At reading ===


* User goes to a page. If SW is not loaded, inserte script redirect to index.html, which load SW and  register itself as new collection (using "top frame" as top page) and redirect to request page once collection is added.
* User goes to a page. If SW is not loaded, inserted script redirect to index.html, which load SW and  register itself as new collection (using "top frame" as top page) and redirect to request page once collection is added.
* SW handle the URL, it does:
* SW handle the URL, it does:
** Find the rigth collection (base on book name)
** Find the right collection (base on book name)
** make coll.handleRequest
** make [https://github.com/webrecorder/wabac.js/blob/main/src/collection.js#L56 coll.handleRequest]
*** does `getReplayResponse`
*** does `[https://github.com/webrecorder/wabac.js/blob/main/src/collection.js#L275 getReplayResponse]`
**** does store.getResource()
**** does [https://github.com/webrecorder/wabac.js/blob/main/src/remotewarcproxy.js#L21 store.getResource()]
***** Do a request for `H/URL` and if not found, generate "fuzzy url" and do request `H/fuzzyurl` for each fuzzy url.  Once it found a `H/url` it stops. If it doesn't found a header return null
***** Do a request for <code>H/url</code> and if not found, generate "fuzzy url" and do request <code>H/fuzzyurl</code> for each fuzzy url.  Once it found a <code>H/(fuzzy)url</code> it stops. If it doesn't found a header return null
***** If header is revisit, resolve it (by doing another request to `H/target_url`
***** If header is a revisit, resolve it (by doing another request to <code>H/target_url</code>)
***** At the end, get the payload by doing `A/final_url`
***** At the end, get the payload by doing <code>A/final_url</code>
***** Build a ArchiveResponse with header and payload
***** Build a <code>ArchiveResponse</code> with header and payload
*** insert js script loading wombat in the html content.
*** insert js script loading wombat in the html content.
*** rewrite the ArchiveResponse content.
*** rewrite the ArchiveResponse content.
*** merge header from ArchiveResponse into the SW response ("range", "date", "set-cookis", ...)
*** merge headers from ArchiveResponse into the SW response (<code>range</code>, <code>date</code>, <code>set-cookies</code>, ...)
*** return response to iframe
*** return response to requester


   
   
Line 72: Line 72:
Wombat is loaded in all pages as a web worker. Js code is wrapped in a wombat context which rewrite outgoing url (fetch/location changes/...) before doing the request itself.
Wombat is loaded in all pages as a web worker. Js code is wrapped in a wombat context which rewrite outgoing url (fetch/location changes/...) before doing the request itself.


=== Comparaison with pywb. ===
=== Comparison with pywb. ===
The workflow of pywb (with a WARC archive) is almost the same but with small simplification as the rewriting part and fuzzymatching is made by the server itself without serviceworker.
The workflow of pywb (with a WARC archive) is almost the same but with small simplification as the rewriting part and fuzzymatching is made by the server itself without serviceworker.


Line 102: Line 102:


JS rewriter: rewrite few links but mostly wrap the code in a "wombat context".
JS rewriter: rewrite few links but mostly wrap the code in a "wombat context".
HTML rewrite: rewrite html and use CSS/JS rewriter as subrewriter for <code><style></code>/<code><script></code> tags


JSONP rewriter: May rewrite the content base on the request's querystring (!!!!!)
JSONP rewriter: May rewrite the content base on the request's querystring (!!!!!)
Line 119: Line 121:


* <code><server_host></code>: depends of the production environement (host name, root prefix)
* <code><server_host></code>: depends of the production environement (host name, root prefix)
* <code><collection></code>: depends of the zim filename (we may change to base ourself on zimid ?)
* <code><collection></code>: depends of the zim filename (we may change to base ourselves on zimid ?)
* <code><requested_url></code>: In case of "revisit", pywb and wabac return the content of another record.  It rewrite the content based on "the requested url or the record url ?".  The same way, in case of fuzzymatching, request url is diferent than record url.
* <code><requested_url></code>: In case of "revisit", pywb and wabac return the content of another record.  It rewrite the content based on "the requested url or the record url ?".  The same way, in case of fuzzymatching, request url is different than record url.
* jsonp need access to the "callback" querystring value of the request.
* jsonp need access to the "callback" querystring value of the request.


We could do the static rewriting by setting placeholder (<code>${RW_SERVER_HOST}</code>, <code>${RW_URL}</code>, ...) for things that needs to be rewritten dynamically.


We could do the static rewriting by setting placeholder (<code>${RW_SERVER_HOST}</code>, <code>${RW_URL}</code>, ...) for things that needs to be rewritten dynamically.
Wombat initialization would be inserted in html page at this step. Wombat itself will be used exactly the same way we use it now (catching url changes/requests coming from js and rewrite it to "local" url)


=== At reading ===
=== At reading ===
Line 149: Line 152:
This workflow should be compatible with existing zim files (no <code>H</code> nor <code>W/fuzzy_rules</code>).
This workflow should be compatible with existing zim files (no <code>H</code> nor <code>W/fuzzy_rules</code>).


Searching by <code>C/url</code> first allow to avoid putting a <code>H/url</code> for the commmon case, even for warc2zim files.
Searching by <code>C/url</code> first allow to avoid putting a <code>H/url</code> for the common case, even for warc2zim files.


This allow potential fuzzy matching for other zim files (specific scrapper)
This allow potential fuzzy matching for other zim files (specific scrapper)
Line 157: Line 160:
* The possible placeholders (<code>${RW_SERVER_HOST}</code>, ...) and their value
* The possible placeholders (<code>${RW_SERVER_HOST}</code>, ...) and their value
* The header <code>H/url</code> format (just a subset of header to apply ?)
* The header <code>H/url</code> format (just a subset of header to apply ?)
* The fuzzy rules (how to generate fuzzy url from the data driven fuzzy_rules). https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js seems to be a good "specification".
* The fuzzy rules (how to generate fuzzy url from the data driven fuzzy_rules). https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js seems to be a good "specification"
 
== Notes/Questions: ==
 
* Revisit and redirect are different: redirect make kiwix-serve return a 302 to the target. revisit make kiwix-serve answer a 2xx with the content of the target revisit.
* We may anyway store <code>H</code> revisit as redirect entry in the zim file.
*Restrict <code>H/url</code> lookup for entries with a specific mimetype (there's no standard, we can set an <code>X-HTTP-Headers</code>)
*Maybe keep a switch (using a ''private'' tag ?) to toggle content rewriting as there is no reason to run that on ZIMs that don't need it.
*I can't think of any use (but debug) to expose the fuzzy rules. Not having them in C would be another reason to allow pylibzim to access X NS. Right now only way is via ID and is sort of a hack.
*Are we keeping the ''modifier'' prefix ? You mention it at creation time but don't afterwards. I understand it's '''mostly''' Content-Type based and used to toggle rewriting. I understand what's written above as: we'll conditionally rewrite some stuff but use the Content-Type instead. Correct?
*What will our entry paths look like? Full URL? <code>/<nowiki>https://developer.mozilla.org/en-US/</nowiki></code> ? Current warc2zim stores a canonicalized version without scheme on ZIM but the content and SW uses full URLs.
*We'll need to reconstruct the URL by concatenating any query parameter sent to reader/kiwix-serve. We should be aware that this could be challenging on some websites as a website could generate both <code>/home?article_id=32&lang=fr</code> and <code>/home?lang=fr&article_id=32</code> because in a normal dynamic server context this is the same but in our static one it's not. The SW probably took care of that ; we should look into how it was implemented.
*We won't have any chrome nor iframe anymore. MainPage would be the start URL.


== Questions ==
== Questions ==
Line 165: Line 180:
* I URL rewriting really data-driven? Same question for Fuzzy-matching?
* I URL rewriting really data-driven? Same question for Fuzzy-matching?
* Can we easily use Wombat without the rest of Wabac?
* Can we easily use Wombat without the rest of Wabac?
=== Matthieu ===
* What are the information needed to rewrite html/css/js content ? At which point it is linked to the current request ?  I have identified <code>callback</code> querystring. Other ?
* Do we rewrite content using the url of the record or the requested url ?
* pywb can work framed or frameless (https://pywb.readthedocs.io/en/latest/manual/configuring.html#framed-vs-frameless-replay). We are using a framed system with SW. Why ? Is it necessary ?
*pywb rewriter (https://github.com/webrecorder/pywb/tree/main/pywb/rewrite) and wabac.js rewrite (https://github.com/webrecorder/wabac.js/tree/main/src/rewrite) seems to do the same things. What are the differences (apart from implementation languages)  ?
*Same question for pywb fuzzymatcher (https://github.com/webrecorder/pywb/blob/main/pywb/warcserver/index/fuzzymatcher.py) and wabac fuzzymatcher (https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js)
=== renaud ===
* What makes the SW mandatory to replay? What is the constraint that requires it?
* If not restricted to the sole browser (ie. kiwix-serve or any kiwix reader serving as a dynamic backend), what are the key information that are required for wombat? Just the serving URL? Is the timestamp important?
* Fuzzy Matching rules are found in wabac, wombat, pywb and warc2zim. Is this redundancy or are tere multiple layers?
* What's the extent of wombat's role? How far does it go and how required is it?
* What are “prefix queries”? “prefix search”?
* How does the replayer cache system works? What's its main purpose? Can it be turned off?
*What's the difference between a ''page'' as (in pages.jsonl) and a `text/html` entry? Status Code only?
* Is there a WARC testing suite with various use and corner cases ?
10

edits

Navigation menu