Difference between revisions of "How-to create a Python scraper"
(Add information about evidences needed) |
(Split up best practices into sections with recommendations about different facets) |
||
Line 16: | Line 16: | ||
=== Best practices === | === Best practices === | ||
This section contains best practices for developing scrapers related to different aspects of developing a scraper. | |||
==== Development ==== | |||
A Python scraper should ideally match other openZIM scrapers by: | |||
- | * adhering to openZIM's [https://github.com/openzim/overview/wiki/Contributing Contribution Guidelines] | ||
* implementing openZIM's [https://github.com/openzim/_python-bootstrap/docs/Policy.md Python bootstrap, conventions and policies] | |||
* being hosted on Github under the openzim [https://github.com/openzim/ organization] | |||
* using utilities from [https://github.com/openzim/python-scraperlib/ python-scraperlib] ([https://pypi.org/project/zimscraperlib/ zimscraperlib] on PyPi) to create the ZIM and create a consistent experience | |||
==== Scraper User Experience ==== | |||
The scraper will mostly be used by individuals who curate their own ZIM collections, and Zimfarm maintainers. To help them, the scraper should: | |||
* be configurable with CLI flags, especially for ZIM metadata (title, description, tags, ...) and filename | |||
* validate all flags and metadata as early as possible to avoid spending time fetching online resources only to find they cannot produce a ZIM | |||
* implement proper logging with various log levels (error, warning, info, debug) | |||
* create logs that are easy to understand e.g. avoiding filenames in logs by using the [https://github.com/openzim/python-scraperlib/blob/main/src/zimscraperlib/logging.py logging package in zimscraperlib] | |||
==== Supporting Zimfarm ==== | |||
Scrapers should be good citizens when operating in shared infrastructure such as Zimfarm by: | |||
- | * re-encoding images and videos so that the final ZIM size is (by default at least) moderate | ||
* caching re-encoded assets on an S3 bucket (we can provide you with a dev bucket on request) to avoid wasting compute time | |||
- | * avoiding unnecessary filesystem access i.e. prefer to add items to the ZIM on-the-fly rather than staging them on the FS then creating the ZIM | ||
* consuming as few resources as possible (CPU time, disk IOs, disk space, memory, network, ...) | |||
* implement a [https://github.com/openzim/gutenberg/blob/f3440861d45e46a50aeea2e79255ede16aee9121/src/gutenberg2zim/export.py#L293 task progress JSON file] so Zimfarm can report status | |||
=== How to develop a nice UI to run inside the ZIM === | === How to develop a nice UI to run inside the ZIM === |
Revision as of 21:18, 23 July 2024
This page is a high level outline about the considerations you need to create a new scraper that produces ZIM files that can be used in Kiwix or other compatible readers.
Developing the scraper
- Decide what resource you want to create a scraper for.
- If the resource is a website, check to see if https://zimit.kiwix.org/ works.
- Makes sure none of the existing scrapers work for your use-case.
- Decide how you'll want to implement the scraper and put together a proposal including the following and submit a request in the
zim-requests
repository so the community can give you feedback and create a repository for you if needed (Example). Some questions you might want to answer in the request are:- Information about the resource you want to scrape.
- Especially evidences the future scraper will be in-line with openZIM content purpose and publishing policies
- Sometimes this is obvious, sometimes it needs some explanation, especially regarding content licensing.
- If the scraper is not inline with these requirements, this is not a show-stopper at all but we will probably not host this scraper under the openzim umbrella, and will probably dedicate less efforts to support you (but you are still free to create a scraper, all our software stack is free software).
- Especially evidences the future scraper will be in-line with openZIM content purpose and publishing policies
- Why create a new scraper versus using one that already exists (or increasing the capabilities of a closely related existing scraper).
- A rough sketch of your proposal.
- Information about the resource you want to scrape.
- Implement the scraper using the Python bootstrap repository as a basis.
Best practices
This section contains best practices for developing scrapers related to different aspects of developing a scraper.
Development
A Python scraper should ideally match other openZIM scrapers by:
- adhering to openZIM's Contribution Guidelines
- implementing openZIM's Python bootstrap, conventions and policies
- being hosted on Github under the openzim organization
- using utilities from python-scraperlib (zimscraperlib on PyPi) to create the ZIM and create a consistent experience
Scraper User Experience
The scraper will mostly be used by individuals who curate their own ZIM collections, and Zimfarm maintainers. To help them, the scraper should:
- be configurable with CLI flags, especially for ZIM metadata (title, description, tags, ...) and filename
- validate all flags and metadata as early as possible to avoid spending time fetching online resources only to find they cannot produce a ZIM
- implement proper logging with various log levels (error, warning, info, debug)
- create logs that are easy to understand e.g. avoiding filenames in logs by using the logging package in zimscraperlib
Supporting Zimfarm
Scrapers should be good citizens when operating in shared infrastructure such as Zimfarm by:
- re-encoding images and videos so that the final ZIM size is (by default at least) moderate
- caching re-encoded assets on an S3 bucket (we can provide you with a dev bucket on request) to avoid wasting compute time
- avoiding unnecessary filesystem access i.e. prefer to add items to the ZIM on-the-fly rather than staging them on the FS then creating the ZIM
- consuming as few resources as possible (CPU time, disk IOs, disk space, memory, network, ...)
- implement a task progress JSON file so Zimfarm can report status
How to develop a nice UI to run inside the ZIM
Original scrapers are using Jinja2 to render HTML files dynamically and add them to the ZIM. We are currently migrating to another approach where the UI running inside the ZIM is a Vue.JS project. We are not yet certain which approach is best. Vue.JS allows to quickly built very dynamic interfaces in a clean way, where Jinja2 approach usually relied on "crappy" JS based on JQuery and stuff like that. However Vue.JS comes with a probably more limited set of supported browsers and induces a more steep learning curve to contribute on scrapers. Freecodecamp scraper is already using this Vue.JS approach. Youtube scraper is currently migrating to this approach. Kolibri scraper has began to migrate both stuff is still stuck in a v2 branch.
Additional resources
- Contributing guidelines: https://github.com/openzim/overview/wiki/Contributing
- Python bootstrap library: https://github.com/openzim/_python-bootstrap
- Kiwix #scrapers Slack channel for questions: https://app.slack.com/client/T41CNGNKF/C013Z8Q7067