Difference between revisions of "Content team"

Jump to navigation Jump to search
Fix formatting of recipe periodicity
(Fix formatting of recipe periodicity)
 
(12 intermediate revisions by 3 users not shown)
Line 68: Line 68:
* Any recipe should run successfully first in dev before been put in production
* Any recipe should run successfully first in dev before been put in production
* Hardware resources should be saved
* Hardware resources should be saved
** Handling of server side errors
*** HTML content HTTP 4xx and HTTP 5xx requery should ether lead to a scraper error (exit) or the content could be replaced by a placeholder explaining
*** This is a server-side error and not a scraper error
*** Sharing a few details about the nature of the error
*** Explaining if that this might be temporary
*** Ideally linking to our ticketing system for further information. This implies that the list of telerated errors is clearly documented in the code.
*** A low tolerance in percentage of the total amount of pages AND with a fix value should be hardcoded in the scraper
*** The ist of errors should be share at the end of the scraping process


=== Library Management ===
=== Library Management ===
Line 95: Line 103:
=== Scraping ===
=== Scraping ===


==== Change a recipe/ZIM warehouse path ====
==== Create a Youtube recipe ====
Changing the warehouse path of a recipe, once a first ZIM has already been produced, is not a negligible action. It has impact on the library and on the [https://imager.kiwix.org Kiwix Hotspot Imager]. Therefore, accions must be coordinated.
 
To create a new recipe to scrape videos from a Youtube Channel/Username or one-or-more Playlists.
 
It’s recommended to clone an existing Youtube recipe.
 
* In "Content settings":
# Create the recipe name as per [https://github.com/openzim/overview/wiki/Naming-Convention the naming conventions].
# In the Language space, choose the language(s) of the Youtube page you are creating the recipe for.
# From Category space, choose (other)
# From warehouse path space, choose "/.hidden/.dev" always as a first time in order to test the resulted ZIM file.
# if the file is tested and all is correct then you update the recipe with the proper path "videos". Otherwise tune the recipe and relaunch a task.
# Make sure the Status is set to Enabled.
# You can choose Periodicity to be monthly or quarterly. Use monthly per default.
 
* In "Task settings":
# In Offliner space choose: Youtube
# In platform space choose Youtube.
# Keep the rest the same with no change.
 
*In "Scraper settings: youtube command flags":
# In Playlist mode: choose (Not Set) if you are doing the recipe for a whole channel.
# If you are doing the recipe for a playlist, choose (Set).
# In Type: choose (Channel) or (Playlist) as per your required file.
# In Youtube ID: type the ID of the channel or the playlist.
# For the API Key: There is a list of keys mostly as per the channel or the playlists sizes, ask for the list to choose the appropriate API Key.
# In ZIM Name: the recipe name as per the naming conventions [here](https://github.com/openzim/overview/wiki/Naming-Convention).
# In Title: type the name you want for the output file.
# Description: type a short description of your required zim file.
# Leave Optimisation Cache URL as it is (cloned from old recipe).
# Leave the rest of the fields empty or as per the cloned recipe.
# Finally, click in the bottom on (Update offliner details).
# Review all your entries once again, then go back to the top of the page and click on (Request).
# After about an hour, check the recipe if it failed or succeeded (or the next day if the source website is large).
# If successful, go to this link ([dev.library.kiwix.org](https://dev.library.kiwix.org/)) and check your created file, check the size and check if the file is working properly. If the file does not appear, wait a bit as updates are made every 15 minutes.
# If the file looks good and complete, go back to your recipe, In warehouse path space, change(/.hidden/.dev) to the proper category related to your file content (Wikipedia, Wikihow, … etc).
# Click on Update offliner details and then click on Request again.
# Finally, check the file in [https://library.kiwix.org/ Kiwix Content Library]. If all is good, do not forget to go back to [https://github.com/openzim/zim-requests/issues the initial ticket] and put the link of the output file and close the ticket.
==== Choose proper recipe periodicity ====
'''''This is a draft proposal'''''
 
When we configure a recipe on the Zimfarm, we have to decide on the periodicity at which the recipe will be ran.
 
Following rules should be followed, unless justified by an exception:
 
* by default, the periodicity is quarterly
* recipes linked to content which is very regularly updated might switch to monthly updates ; this is typically the case for all recipes linked to Wikimedia wikis
* recipes known to take a lot of time to complete / consume much resources / be linked to content not regularly updated should be switched to bi-annually or annually periodicity (at the discretion of recipe maintainer)
* recipes in DEV (pushing to /.hidden/dev) have a manual periodicity:
** the person setting up the recipe will take care of updating the ZIM when needed, having a manual process helps to avoid side-effects during testing by not all testing the same ZIM
** we aim to put the time during which a recipe is in DEV to a minimum
** we have too many recipe in DEV which are failing and not yet disabled, if the update is automated it will continuously waste resources
* recipes building ZIMs for a specific customer have a manual periodicity by default, unless we have a clear maintenance contract paying us to update ZIMs at a given interval, or unless the ZIM in question is of general interest (but then we usually do not consider this ZIM to be linked to a specific customer)
 
==== Change a recipe/ZIM warehouse path and/or a ZIM name ====
Changing the warehouse path of a recipe, once a first ZIM has already been produced, is not a negligible action. It has impact on the library and on the [https://imager.kiwix.org Kiwix Hotspot Imager]. Therefore, actions must be coordinated.


It is hence mandatory that whenever a recipe needs to change its warehouse path, [https://github.com/openzim/zim-requests openzim/zim_requests a ticket has to be open at GitHub] and assigned to both @RavanJAltaie, @benoit74 and @rgaudin for proper coordination:
It is hence mandatory that whenever a recipe needs to change its warehouse path, [https://github.com/openzim/zim-requests openzim/zim_requests a ticket has to be open at GitHub] and assigned to both @RavanJAltaie, @benoit74 and @rgaudin for proper coordination:
Line 102: Line 164:
# Disable the recipe in Zimfarm (''a priori'' @RavanJAltaie)
# Disable the recipe in Zimfarm (''a priori'' @RavanJAltaie)
# Wait until there are no more in-progress Orders in the Kiwix Hotspot Imager that include those ZIMs (''a priori'' @rgaudin)
# Wait until there are no more in-progress Orders in the Kiwix Hotspot Imager that include those ZIMs (''a priori'' @rgaudin)
#Put Kiwix Hotspot Imager in maintenance (''a priori'' @rgaudin)
# Move existing ZIMs on the file server (''a priori'' @benoit74)
# Move existing ZIMs on the file server (''a priori'' @benoit74)
# Trigger Kiwix Hotspot Imager catalog refresh right after so any Order created right after uses the new URL (''a priori''  @rgaudin)
# Trigger catalog refresh right after so any Imager Order / download created right after uses the new URL (''a priori''  @rgaudin)
# Update the warehouse path in Zimfarm (''a priori''  @RavanJAltaie)
#Remove Kiwix Hotspot Imager from maintenance (''a priori'' @rgaudin)
# Update the warehouse path / ZIM name in Zimfarm (''a priori''  @RavanJAltaie)
# Re-enable the recipe in Zimfarm (''a priori''  @RavanJAltaie)
# Re-enable the recipe in Zimfarm (''a priori''  @RavanJAltaie)


Line 160: Line 224:


It is hence mandatory that, whenever a recipe/ZIM needs to be deleted, [https://github.com/openzim/zim-requests openzim/zim_requests a ticket is opened on GitHub] and assigned to both @benoit74 and @rgaudin for proper coordination:
It is hence mandatory that, whenever a recipe/ZIM needs to be deleted, [https://github.com/openzim/zim-requests openzim/zim_requests a ticket is opened on GitHub] and assigned to both @benoit74 and @rgaudin for proper coordination:
# Wait until there are no more in-progress Orders in the Kiwix Hotspot Imager that include those ZIMs (''a priori'' @rgaudin)
# Add a delete marker on storage (if <code>zim/zimit/my_zim.zim</code> needs to be removed from catalog, you have to "touch" <code>zim/zimit/my_zim.delete</code>)
# Delete ZIMs on the file server (''a priori'' @benoit74)
#Wait for library catalog to be regenerated
# Trigger Kiwix Hotspot Imager catalog refresh right after the move so any Order created right after uses the new URL (''a priori'' @rgaudin)
#Check that there are no more in-progress Orders in the Kiwix Hotspot Imager that include those ZIMs
# Delete ZIM (and delete marker) from the file server


''Nota'': Moving a file to the archive has to be considered as a file deletion.
''Nota'': Moving a file to the archive has to be considered as a file deletion.


=== (draft) Notes from former Youtube workflow (Draft) ===
==== Demo a ZIM ====
## To create a new recipe for youtube files
From time to time, we need to demo a ZIM to a customer before releasing it into the wild. We have a demo instance at https://clients.library.kiwix.org/
 
Configuration is done through the file at https://github.com/kiwix/operations/blob/main/zim/clients-library/demos.yaml ; should you need to create a new demo, modify or delete an existing one, simply open a PR with your modifications on this file and ask @rgaudin or @benoit74 for review.
 
Every ZIM can be referenced either by full path or by path up-to-the-date, I which case most recent one will be automatically selected at each configuration redeployment.
 
Once merged, this configuration is automatically redeployed every hour, so once your PR is merged give it a bit of time to be deployed.
 
After that, send the demo URL to our client, e.g. <nowiki>https://demo.library.kiwix.org/</nowiki>'''home/'''my_demo/ ; this URL will be updated every time you modify the configuration or ZIMs gets updated.
 
It is now '''forbidden''' to send a link on https://dev.library.kiwix.org to a customer, this is not an infrastructure meant to be highly available and can be shutdown at any time without notice.


**It’s recommended to clone an existing Youtube recipe.**
''Nota:''


* Create the recipe name as per the naming conventions [here](https://github.com/openzim/overview/wiki/Naming-Convention).
- all demos must have an expired_on property, and they are automatically removed at this date
* In the Language space, choose the language of the website you are creating the recipe for.
* From Category space, choose (other)
* From warehouse path space, choose (/.hidden/.dev) always as a first time in order to test the resulted file, if the file is tested and all is correct then you update the recipe with the proper path (videos).
* Make sure the Status is set to Enabled.
* You can choose Periodicity to be monthly or quarterly.
* In Offliner space choose: Youtube
* In platform space choose Youtube.
* Keep the rest the same with no change.


**In Youtube command flags:**
- this infrastructure can serve any ZIM available on our storage (public and hidden ones)


* In Playlist mode: choose (Not Set) if you are doing the recipe for a whole channel.
- adding or removing a demo or ZIMs does not make any modification to the ZIMs stored in our storage
* If you are doing the recipe for a playlist, choose (Set).
* In Type: choose (Channel) or (Playlist) as per your required file.
* In Youtube ID: type the ID of the channel or the playlist.
* For the API Key: There is a list of keys mostly as per the channel or the playlists sizes, ask for the list to choose the appropriate API Key.
* In Zim Name: the recipe name as per the naming conventions [here](https://github.com/openzim/overview/wiki/Naming-Convention).
* In Title: type the name you want for the output file.
* Description: type a short description of your required zim file.
* Leave Optimisation Cache URL as it is (cloned from old recipe).
* Leave the rest of the fields empty or as per the cloned recipe.
* Finally, click in the bottom on (Update offliner details).
* Review all your entries once again, then go back to the top of the page and click on (Request).
* After about an hour, check the recipe if it failed or succeeded (or the next day if the source website is large).
* If successful, go to this link ([dev.library.kiwix.org](https://dev.library.kiwix.org/)) and check your created file, check the size and check if the file is working properly. If the file does not appear, wait a bit as updates are made every 15 minutes.
* If the file looks good and complete, go back to your recipe, In warehouse path space, change(/.hidden/.dev) to the proper category related to your file content (Wikipedia, Wikihow, … etc).
* Click on Update offliner details and then click on Request again.
* Finally, check the file in (https://library.kiwix.org/ ). If all is good, do not forget to go back to the initial ticket  (most likely at zim-requests) and put the link of the output file and close the ticket.


== Members ==
== Members ==
* [https://github.com/Popolechien Popolechien], manager in line
* [https://github.com/Popolechien Popolechien], manager in line
* [https://github.com/RavanJAltaie Ravan], content manager
* [https://github.com/benoit74 Benoit74], scrapers lead dev
* [https://github.com/benoit74 Benoit74], scrapers lead dev


== See also ==
== See also ==
* [[Content strategy]]
* [[Content strategy]]
50

edits

Navigation menu