Content team/ZIM Naming Convention
This page was originally located at https://github.com/openzim/overview/wiki/ZIMs-Naming-Convention
Context
- When publishing a ZIM, it's important to pay attention to its metadata as those are the way other people will distinguish it from other content
- Metadata lists the common and required metadata expected for a ZIM file
- None of them needs to be unique. ZIMs already includes an identifier (called ID that is a UUID) that is generated automatically during creation. It doesn't diminishes the value of the other metadata though. You still want readers to easily and confidently choose ZIMs according to those.
- We need to ensure collisions will not happen (two different websites leading to the same ZIM Name typically) and that the user understand which source content he is downloading / using
- Choosing good and appropriate metadata can be difficult, but it's not what this document is about.
This document is about setting valid Name
metadata and filename for openZIM-created ZIMs (usually via the Zimfarm).
Why do we care?
- We create thousands of ZIMs every month. Convention is essential to be able to automate some tasks.
- Convention means applying a pattern, so no need to find what to use: simpler, faster.
- We use
Name
metadata to match Zimfarm-produced ZIMs with *Titles* in the CMS - We use
Name
metadata to set the ZIM filename in most scrapers. - Many scripts depends on the filenames to maintain the central library: build the XML library, move files to appropriate folder, evict older files, generate redirects, etc.
- Offspot YAML catalog uses *Human IDs* that are derived from the filenames.
ZIM Name
Metadata
Format: {project}_{lang}_{selection}
The _
character is reserved as separator between the parts.
The parts must only contain alphanums or -
or .
characters.
The parts must be all lowercase.
Part | Description | Example |
---|---|---|
project
|
Domain name (or project) 1 | android.stackexchange.com , wikipedia
|
lang
|
ISO-639-1 (2 chars) language code | en , fr , zh , mul 2
|
selection
|
A short, slug-like string indicating the selection over the project | all , top , football
|
- 1 Domain name by default, project names are exceptions (basically valid only if we at least have a dedicated category for this project); use domain names if unsure, or best, ask on Slack. Should domain name could contains illegal characters for our convention, it will be encoded with Punycode, e.g. https://www.punycoder.com/)
- 2
mul
is to be used for multiple-language ZIMs. Note that the ZIMLanguage
metadata lists the languages (ISO-639-3) instead of usingmul
ZIM filename
Format: {Name}[_{flavour}]_{period}.zim
The _
character is reserved as separator between the parts.
The parts must only contain alphanums or -
or .
characters.
The filename must be all lowercase.
Part | Description | Example |
---|---|---|
Name
|
The Name metadata described above 1
|
wikipedia_fr_top , wikihow_th_all , stackoverflow.com_en_all
|
flavour
|
Optional. One of the existing flavour indicating a modification of the content for size reasons | mini , nopic , maxi
|
period
|
The period when the ZIM has been created, in format YYYY-MM (year-month) | 2019-03 , 2022-12
|
- 1 It doesn't need to be the equal to the `Name` metadata but requirements identical.
Zimfarm
Depending on the scraper, setting the Name
metadata in the Zimfarm can be mandatory (follow above instructions) or optional. When optional, the scraper usually properly sets it according to the convention. Should it not, open a ticket on the scraper repo and set it manually in the recipe until it is fixed.
Filenames are also optional in the Zimfarm but the common behavior is to append the period-part (ex: _2022-01
after the value of the Name
metadata. If you customized the Name
, make sure the filename will remain valid or set it manually.
Important: when setting filename manually, you are responsible for the whole filename, including the period part. Most scraper allow inserting a special `{period}` string that will be replaced with the year-date one. Ex: supersite.com_en_all_{period}.zim
.