Difference between revisions of "Zimit"

Jump to navigation Jump to search
2,578 bytes added ,  12:31, 24 May 2023
 
(6 intermediate revisions by 2 users not shown)
Line 53: Line 53:
=== At reading ===
=== At reading ===


* User goes to a page. If SW is not loaded, inserte script redirect to index.html, which load SW and  register itself as new collection (using "top frame" as top page) and redirect to request page once collection is added.
* User goes to a page. If SW is not loaded, inserted script redirect to index.html, which load SW and  register itself as new collection (using "top frame" as top page) and redirect to request page once collection is added.
* SW handle the URL, it does:
* SW handle the URL, it does:
** Find the rigth collection (base on book name)
** Find the right collection (base on book name)
** make [https://github.com/webrecorder/wabac.js/blob/main/src/collection.js#L56 coll.handleRequest]
** make [https://github.com/webrecorder/wabac.js/blob/main/src/collection.js#L56 coll.handleRequest]
*** does `[https://github.com/webrecorder/wabac.js/blob/main/src/collection.js#L275 getReplayResponse]`
*** does `[https://github.com/webrecorder/wabac.js/blob/main/src/collection.js#L275 getReplayResponse]`
Line 125: Line 125:
* jsonp need access to the "callback" querystring value of the request.
* jsonp need access to the "callback" querystring value of the request.


We could do the static rewriting by setting placeholder (<code>${RW_SERVER_HOST}</code>, <code>${RW_URL}</code>, ...) for things that needs to be rewritten dynamically.


We could do the static rewriting by setting placeholder (<code>${RW_SERVER_HOST}</code>, <code>${RW_URL}</code>, ...) for things that needs to be rewritten dynamically.
Wombat initialization would be inserted in html page at this step. Wombat itself will be used exactly the same way we use it now (catching url changes/requests coming from js and rewrite it to "local" url)


=== At reading ===
=== At reading ===
Line 161: Line 162:
* The fuzzy rules (how to generate fuzzy url from the data driven fuzzy_rules). https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js seems to be a good "specification"
* The fuzzy rules (how to generate fuzzy url from the data driven fuzzy_rules). https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js seems to be a good "specification"


== Notes: ==
== Notes/Questions: ==


* Revisit and redirect are different: redirect make kiwix-serve return a 302 to the target. revisit make kiwix-serve answer a 2xx with the content of the target revisit.
* Revisit and redirect are different: redirect make kiwix-serve return a 302 to the target. revisit make kiwix-serve answer a 2xx with the content of the target revisit.
* We may anyway store <code>H</code> revisit as redirect entry in the zim file.
* We may anyway store <code>H</code> revisit as redirect entry in the zim file.
*Restrict <code>H/url</code> lookup for entries with a specific mimetype (there's no standard, we can set an <code>X-HTTP-Headers</code>)
*Maybe keep a switch (using a ''private'' tag ?) to toggle content rewriting as there is no reason to run that on ZIMs that don't need it.
*I can't think of any use (but debug) to expose the fuzzy rules. Not having them in C would be another reason to allow pylibzim to access X NS. Right now only way is via ID and is sort of a hack.
*Are we keeping the ''modifier'' prefix ? You mention it at creation time but don't afterwards. I understand it's '''mostly''' Content-Type based and used to toggle rewriting. I understand what's written above as: we'll conditionally rewrite some stuff but use the Content-Type instead. Correct?
*What will our entry paths look like? Full URL? <code>/<nowiki>https://developer.mozilla.org/en-US/</nowiki></code> ? Current warc2zim stores a canonicalized version without scheme on ZIM but the content and SW uses full URLs.
*We'll need to reconstruct the URL by concatenating any query parameter sent to reader/kiwix-serve. We should be aware that this could be challenging on some websites as a website could generate both <code>/home?article_id=32&lang=fr</code> and <code>/home?lang=fr&article_id=32</code> because in a normal dynamic server context this is the same but in our static one it's not. The SW probably took care of that ; we should look into how it was implemented.
*We won't have any chrome nor iframe anymore. MainPage would be the start URL.


== Questions ==
== Questions ==
Line 177: Line 185:
* What are the information needed to rewrite html/css/js content ? At which point it is linked to the current request ?  I have identified <code>callback</code> querystring. Other ?
* What are the information needed to rewrite html/css/js content ? At which point it is linked to the current request ?  I have identified <code>callback</code> querystring. Other ?
* Do we rewrite content using the url of the record or the requested url ?
* Do we rewrite content using the url of the record or the requested url ?
* pywb can work with framed or frameless (https://pywb.readthedocs.io/en/latest/manual/configuring.html#framed-vs-frameless-replay). We are using a framed system with SW. Why ? Is it necessary ?
* pywb can work framed or frameless (https://pywb.readthedocs.io/en/latest/manual/configuring.html#framed-vs-frameless-replay). We are using a framed system with SW. Why ? Is it necessary ?
*pywb rewriter (https://github.com/webrecorder/pywb/tree/main/pywb/rewrite) and wabac.js rewrite (https://github.com/webrecorder/wabac.js/tree/main/src/rewrite) seems to do the same things. What are the differences (apart from implementation languages)  ?
*pywb rewriter (https://github.com/webrecorder/pywb/tree/main/pywb/rewrite) and wabac.js rewrite (https://github.com/webrecorder/wabac.js/tree/main/src/rewrite) seems to do the same things. What are the differences (apart from implementation languages)  ?
*Same question for pywb fuzzymatcher (https://github.com/webrecorder/pywb/blob/main/pywb/warcserver/index/fuzzymatcher.py) and wabac fuzzymatcher (https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js)
*Same question for pywb fuzzymatcher (https://github.com/webrecorder/pywb/blob/main/pywb/warcserver/index/fuzzymatcher.py) and wabac fuzzymatcher (https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js)
=== renaud ===
* What makes the SW mandatory to replay? What is the constraint that requires it?
* If not restricted to the sole browser (ie. kiwix-serve or any kiwix reader serving as a dynamic backend), what are the key information that are required for wombat? Just the serving URL? Is the timestamp important?
* Fuzzy Matching rules are found in wabac, wombat, pywb and warc2zim. Is this redundancy or are tere multiple layers?
* What's the extent of wombat's role? How far does it go and how required is it?
* What are “prefix queries”? “prefix search”?
* How does the replayer cache system works? What's its main purpose? Can it be turned off?
*What's the difference between a ''page'' as (in pages.jsonl) and a `text/html` entry? Status Code only?
* Is there a WARC testing suite with various use and corner cases ?
10

edits

Navigation menu