Difference between revisions of "How-to create a Python scraper"

Jump to navigation Jump to search
Add information about evidences needed
(Added additional resources and developing the scraper How-To section)
(Add information about evidences needed)
Line 5: Line 5:
# Decide what resource you want to create a scraper for.
# Decide what resource you want to create a scraper for.
## If the resource is a website, check to see if https://zimit.kiwix.org/ works.
## If the resource is a website, check to see if https://zimit.kiwix.org/ works.
## Make none [https://github.com/openzim?q=scraper&type=all&language=&sort= of the existing scrapers] work for your use-case.
## Makes sure none [https://github.com/openzim?q=scraper&type=all&language=&sort= of the existing scrapers] work for your use-case.
# Decide how you'll want to implement the scraper and put together a proposal including the following and [https://github.com/openzim/zim-requests/issues/new/choose submit a request in the <code>zim-requests</code> repository] so the community can give you feedback and create a repository for you if needed ([https://github.com/openzim/zim-requests/issues/1086#issuecomment-2210235271 Example]).  Some questions you might want to answer in the request are:
# Decide how you'll want to implement the scraper and put together a proposal including the following and [https://github.com/openzim/zim-requests/issues/new/choose submit a request in the <code>zim-requests</code> repository] so the community can give you feedback and create a repository for you if needed ([https://github.com/openzim/zim-requests/issues/1086#issuecomment-2210235271 Example]).  Some questions you might want to answer in the request are:
#* Information about the resource you want to scrape.
#* Information about the resource you want to scrape.
#* Why create a new scraper versus using one that already exists.
#**Especially evidences the future scraper will be in-line with openZIM content [[Content team#Purpose|purpose]] and [[Content team#Publishing|publishing policies]]
#***Sometimes this is obvious, sometimes it needs some explanation, especially regarding content licensing.
#***If the scraper is not inline with these requirements, this is not a show-stopper at all but we will probably not host this scraper under the openzim umbrella, and will probably dedicate less efforts to support you (but you are still free to create a scraper, all our software stack is free software).
#* Why create a new scraper versus using one that already exists (or increasing the capabilities of a closely related existing scraper).
#* A rough sketch of your proposal.
#* A rough sketch of your proposal.
# Implement the scraper [https://github.com/openzim/_python-bootstrap/blob/main/docs/Policy.md using the Python bootstrap repository as a basis].
# Implement the scraper [https://github.com/openzim/_python-bootstrap/blob/main/docs/Policy.md using the Python bootstrap repository as a basis].
Line 15: Line 18:
A Python scraper should ideally:
A Python scraper should ideally:


- Adhere to openZIM's [https://github.com/openzim/overview/wiki/Contributing Contribution Guidelines] and implement openZIM's [https://github.com/openzim/_python-bootstrap/docs/Policy.md Python bootstrap, conventions and policies]
- adhere to openZIM's [https://github.com/openzim/overview/wiki/Contributing Contribution Guidelines] and implement openZIM's [https://github.com/openzim/_python-bootstrap/docs/Policy.md Python bootstrap, conventions and policies]


- by hosted on Github under the openzim [https://github.com/openzim/ organization] (we can create you a repo there on request)
- be hosted on Github under the openzim [https://github.com/openzim/ organization] (we can create you a repo there on request)


- use the [https://github.com/openzim/python-scraperlib/ python-scraperlib] ([https://pypi.org/project/zimscraperlib/ zimscraperlib] on PyPi) to create the ZIM (and there are many useful utilities as well)
- use the [https://github.com/openzim/python-scraperlib/ python-scraperlib] ([https://pypi.org/project/zimscraperlib/ zimscraperlib] on PyPi) to create the ZIM (and there are many useful utilities as well)
50

edits

Navigation menu